UTF-8，数字和正则表达式

This is what I've found in the Kohana3 validator rules:

public static function digit($str, $utf8 = FALSE)
{
    if ($utf8 === TRUE)
    {
        return (bool) preg_match('/^\pN++$/uD', $str);
    }
    else
    {
        return (is_int($str) AND $str >= 0) OR ctype_digit($str);
    }
}

Can someone give an example when passing $utf8 parameter as true and false can give different results (to be precise - false positives for $utf8 == false)?

From what I remember - digits are ascii-safe characters and none of utf-8 characters may be confused with them.

PS: even more detailed - is it possible to fool this check and pass something that in UTF-8 would look not like a number, but would pass the check with $utf-8 == false

Just gave your second question part a bit more alcohol, and my conclusion is that you can't hide an ASCII digit in a UTF-8 sequence. Digits must be 0x30..0x39 or in the bitrange 00110000..00110110..00111001.

UTF-8 encodings include prefixes such as

 11110xxx  10xxxxxx  10xxxxxx

And therefore a digit ASCII representation can't match anywhere:

 00110000 
 ▲▲        00110000  ▼
           ▲         00110000

So it's impossible that it would match in Latin-1/ASCII mode, but also have \pN satisfied in /u mode. Ignoring invalid encodings of course.

Even though 0-9 are ASCII safe, there's a lot of other numbers in Unicode.

See Unicode Characters in the 'Number, Decimal Digit' Category
for a list. Some examples are U+0660 ARABIC-INDIC DIGIT ZERO (٠) and U+1D7EC MATHEMATICAL SANS-SERIF BOLD DIGIT ZERO (

...etc.