I'm trying to handle Polish characters with preg_match
, but something is pretty wrong.
These are my attempts:
Without the u
modifier:
preg_match("@^[0-9A-ZĄąĆćĘꣳÓ󯿏źŃńŚś\-\.\, ]{5,35}$@i", $valuesId)
With the u
modifier:
preg_match("@^[0-9A-ZĄąĆćĘꣳÓ󯿏źŃńŚś\-\.\, ]{5,35}$@iu", $valuesId)
But words like Żółkiewski
, Zielona Góra
or Równina
cannot not passed.
Does anybody know how to handle it correctly without changing server settings?
Are these characters really multi-byte?
As shown by this online demo, the following code returns 1
(the TRUE
value) three times:
$regex = "@^[0-9A-ZĄąĆćĘꣳÓ󯿏źŃńŚś., -]{5,35}$@i";
echo preg_match($regex,"Żółkiewski")."
";
echo preg_match($regex,"Zielona Góra")."
";
echo preg_match($regex,"Równina")."
";
Therefore the problem is not with the regex, but with a mismatch between the encoding of the script where the regex lives and of the input fed to the regex. It may well be, for instance, that your script is using one of the Windows or ISO Eastern European encodings... In which case they may not be multi-byte at all. Many IDEs and editors are able to convert a text file's encoding.
The best choice for future-proofing is to make sure every component of your system talks utf-8:
And so on. Addressing how to achieve all of those is a topic for a book chapter and beyond the scope of this question.
Your regex works as expected. But only if the characters are coming through as UTF-8
. Are you perhaps working in a system that has character encoding set to ISO-8859-2
(Central European ISO Latin 2) which is the ISO standard character set for Polish characters? Look at this example/debugging code I put together. Note I experimented with mb_detect_encoding
as well as mb_convert_encoding
but not clear if that would help or hurt. Feel free to comment out that part of the code if it gets confusing:
// Set a test array.
$test_array = array();
$test_array[] = 'Żółkiewski';
$test_array[] = 'Zielona Góra';
$test_array[] = 'Równina';
// Get the contenst of the URL via file_get_contents.
if (file_exists('zzz_polish.txt')) {
$test_file_array = file('zzz_polish.txt');
}
// Set the header for debugging output.
header('Content-Type: text/plain; charset=utf-8');
// Roll through the test array.
foreach ($test_file_array as $valuesId) {
// Run a regex to detect Polish UTF-8 characters.
preg_match("@^[0-9A-ZĄąĆćĘꣳÓ󯿏źŃńŚś\-\.\, ]{5,35}$@i", $valuesId, $matches);
// Set the character encoding to be UTF-8 if it is not already UTF-8.
if (mb_detect_encoding($valuesId) != 'UTF-8') {
$valuesId = mb_convert_encoding($valuesId, 'UTF-8', array('ISO-8859-2'));
}
// Dump the matches for debugging.
print_r($matches);
}
Now if you place that in a UTF-8 encoded text file with a .php
extension, the results are as follows:
Array
(
[0] => Żółkiewski
)
Array
(
[0] => Zielona Góra
)
Array
(
[0] => Równina
)
Which is expected. But I have been able to recreate a condition where it will fail with superficially seeming data placed in a file named zzz_polish.txt
like this:
Żółkiewski
Zielona Góra
Równina
Now, if I save that file with proper UTF-8 encoding, it works like the example that has the test array in it. But if I cause it to fail by simply changing the file encoding to be UTF-16, it all reads the same to my eyes on the screen, but the output is simply as follows:
Array
(
)
Array
(
)
Array
(
)
So my guess is somewhere in your data chain there is some text encoding mixup happening. Your regex works well otherwise.