I'm using this PHP function for SEO urls. It's working fine with Latin words, but my urls are on Cyrillic. This regex - /[^a-z0-9_\s-]/
is not working with Cyrillic chars, please help me to make it works with non-Latin chars.
function seoUrl($string) {
// Lower case everything
$string = strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^a-z0-9_\s-]/', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
You need to use a Unicode script for Cyrillic alphabet that fortunately PHP PCRE supports it using \p{Cyrillic}
. Besides you have to set u
(unicode) flag to predict engine behavior. You may also need i
flag for enabling case-insensitivity like A-Z
:
~[^\p{Cyrillic}a-z0-9_\s-]~ui
You don't need to double escape \s
.
PHP code:
preg_replace('~[^\p{Cyrillic}a-z0-9_\s-]+~ui', '', $string);
To learn more about Unicode Regular Expressions see this article.
\p{L}
or \p{Letter}
matches any kind of letter from any language.
To match only Cyrillic characters, use \p{Cyrillic}
Since Cyrillic characters are not standard ASCII characters, you have to use u
flag/modifier, so regex will recognize Unicode characters as needed.
Be sure to use mb_strtolower
instead of strtolower
, as you work with unicode characters.
Because you convert all characters to lowercase, you don't have to use i
regex flag/modifier.
The following PHP code should work for you:
function seoUrl($string) {
// Lower case everything
$string = mb_strtolower($string);
// Make alphanumeric (removes all other characters)
$string = preg_replace('/[^\p{Cyrillic}a-z0-9\s_-]+/u', '', $string);
// Clean up multiple dashes or whitespaces
$string = preg_replace('/[\s-]+/', ' ', $string);
// Convert whitespaces and underscore to dash
$string = preg_replace('/[\s_]/', '-', $string);
return $string;
}
Furthermore, please note that \p{InCyrillic_Supplementary}
matches all Cyrillic Supplementary characters and \p{InCyrillic}
matches all non-Supplementary Cyrillic characters.