In an attempt to fight some spam, I'm looking for a way to find out if a string contains any chinese/cyrillic characters.
I have checked the character ranges in UTF-8 at http://en.wikipedia.org/wiki/UTF-8 , but I cannot work out how to work with those in PHP.
What I'd really like to do is count the number of characters that are in the cyrillic ranges, or chinese ranges. Can this be done with some regex?
You can check the byte value of each char for inclusion in a specific Unicode range. Here is a list of Unicode ranges: http://jrgraphix.net/research/unicode_blocks.php
You can easily check if a string is pure UTF-8 by using this:
mb_check_encoding($inputString, "UTF-8");
Just watch out, it seems to have bugs from 5.2.0 to 5.2.6
You might find out what you want on the doc page too mb_check_encoding, specifically in the comments. Adapting javalc6 at gmail dot com's answer to your case:
function check_utf8($str) {
$count = 0; // Amount of characters that are not UTF-8
$len = strlen($str);
for($i = 0; $i < $len; $i++){
$c = ord($str[$i]);
if ($c > 128) {
$bytes = 0;
if ($c > 247) {
++$count;
continue;
} else if ($c > 239)
$bytes = 4;
else if ($c > 223)
$bytes = 3;
else if ($c > 191)
$bytes = 2;
else {
++$count;
continue;
}
if (($i + $bytes) > $len) {
++$count;
continue;
}
while ($bytes > 1) {
$i++;
$b = ord($str[$i]);
if ($b < 128 || $b > 191)
++$count;
$bytes--;
}
}
}
return count;
}
Although I honestly didn't check it.
In PHP, preg_match_all returns the number of full pattern matches.
Try
$n = preg_match_all('/\p{Cyrillic}/u', $text);
or
$n = preg_match_all('/[\p{InCyrillic}\p{InCyrillic_Supplementary}]/u', $text);
For more information regarding using unicode in regex read this article.
Found a nice solution here: https://devdojo.com/blog/tutorials/php-detect-if-non-english
Use this code:
function is_english($str)
{
if (strlen($str) != strlen(utf8_decode($str))) {
return false;
} else {
return true;
}
}
It works because utf8_decode replaces multibyte characters with a single byte, which causes a different string length.