in an original code (Drupal core module) previous developer commented out the string:
if (preg_match('/[^\x{80}-\x{F7} a-z0-9@_.\'-]/i', $name)) {
and instead, added:
if (preg_match('/[^\x{80}-\x{F7} a-z0-9@_.\'-]/iu', $name)) {
Can you help me to understand what the difference between these two? What u modifier does? In php docs I found:
u (PCRE8)
This modifier turns on additional functionality of PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available from PHP 4.1.0 or greater on Unix and from PHP 4.2.3 on win32. UTF-8 validity of the pattern is checked since PHP 4.3.5.
So I guess, previous developer had problems with interpreting special characters or something. I'm a bit puzzled, please advice on this.
The modifier is needed to process utf-8 encoded input properly. A pattern like \xC1 should match the unicode character U+00C1 (À). When you encode Á in utf-8 you get \xC3\x81, so \xC1 doesn't match. The "u" modifier makes the algorithm use utf-8 so it does match.
Basically, when you work with utf-8 encoded text this is what will happen:
<?php
var_dump(preg_match('/\xC1/u', 'Á'));
// => int(1), matches
var_dump(preg_match('/\xC1/', 'Á'));
// => int(0), doesn't match
?>
In your case the first regular expression [^\x80-\xF7] matches no (non-ascii) UTF-8 encoded text because of the way UTF-8 works. The second expression matches unicode characters outside of the range U+0080 - U+00F7, so it lets through all of cyrillic, greek, arab, hebrew, ...
u - means preg match will be check for UTF-8 string not only iso-8859-1 like A-Z
eq
$what = 'łódka - русский алфавит';
if ( preg_match_all('#([\w A-Za-z])#u',$what,$res) ) :
echo 'math eq' . 'łódka - русский алфавит';
endif;