多字节标识符列表

I was looking into multi-byte characters and how they are used but how many different identifiers/pasterns are used for different multi-bytes.

e.g: &nbps;,&#nbsp;,U+0026,%20

how many different identifiers such as &,&#,u+ ,% etc are there ?

Im trying to look for inputs if they have words which are more than 255 characters long then its probably a multi-byte (hack attempt) and then I can check if word can be split has the multi-byte identifier then stop the hack attempt.

% format - a url-encoded value for embedding into URLS, e.g. %20 is a space (ascii 20)
  - named character entity, a non-breaking space in this case
U+0026 - a unicode character in hex notation, an & in this case
&#...; - a numbered character entity in decimal (base10) & = &
&#x...; - a numbered character entity in hex (base 16): & = &

Are you trying to avoid homoglyph-based spoofing ? Does identifier means username here ?

If yes, and if your users use a latin alphabet, just allow only ascii letters and numbers:

$identifier = preg_replace('#[^A-Za-z0-9]+#', '', $identifier);