After converting my site to use utf-8, I'm now faced with the prospect of validating all incoming utf data, to ensure its valid and coherent.
There seems to be various regexp's and PHP API to detect whether a string is utf, but the ones Ive seen seem incomplete (regexps which validate utf, but still allow invalid 3rd bytes etc).
I'm also concerned about detecting (and preventing) overlong encoding, meaning ASCII characters that can be encoded as multibyte utf sequences.
Any suggestions or links welcome!
mb_check_encoding() is designed for this purpose:
mb_check_encoding($string, 'UTF-8');
You can do a lot of things with iconv
that can tell you if the sequence is valid UTF-8.
Telling it to convert from UTF-8 to the same:
$str = "\xfe\x20"; // Invalid UTF-8
$conv = @iconv('UTF-8', 'UTF-8', $str);
if ($str != $conv) {
print("Input was not a valid UTF-8 sequence.
");
}
Asking for the length of the string in bytes:
$str = "\xfe\x20"; // Invalid UTF-8
if (@iconv_strlen($str, 'UTF-8') === false) {
print("Input was not a valid UTF-8 sequence.
");
}