preg_match与UTF8

Let's say I have the following:

$str1 = "via Tokyo";
$str2 = "via 東京";

I want to match any non-whitespace characters after the "via ". Normally I'd use the following:

preg_match("/via\s(\S+)/", $str2, $match);

to obtain the matching characters. I assumed this wouldn't work with the above due to preg_match not understanding utf8, however it works perfectly in this case.

Is this working correctly because preg_match is simply looking for bytes that aren't whitespace, and if so, am I safe to use this for any UTF8 characters?

PS I'm aware that I should really be using the mb_ereg functions for this (or avoiding PHP altogether) but I'm looking for a better understanding of why this works. Thanks!

Yes, UTF-8 uses multi-byte sequences for the special Unicode characters, and it guarantees that they are different from the ASCII ones by having a high bit (undermore). So searching for slash, backslash or space will never have a false positive in a multi-byte sequence.

It's working because the individual bytes that make up 東 and 京 happen to not be whitespace characters in the single-byte character set. Among other things, your regex would happilly accept - - (em space) despite it being a whitespace character.

Try adding the u modifier to the end, to enable UTF-8 support.