I'm trying to find the month in a text written in German. (In an html file)
March is written "März".
I want to be sure that I catch it so I check
Marz, März, März
I tried to use this code
if(preg_match("/ma?ä?(ä)?rz/i", $title))
return 3;
It works fine for the first two, but doesn't with ä. What did I do wrong ?
(The HTML and my PHP files are encoded in UTF8)
Why not just try
(Marz|März|März)
If it's just for searching purposes but not for returning the actual position of the word, you could normalize the search string using html_entity_decode()
and iconv()
:
$string = html_entity_decode($string, ENT_QUOTES, "utf-8");
$string = iconv("UTF-8", "ASCII//TRANSLIT//IGNORE", $string);
// then search for "Marz"
You have to first decode the entities, then use a comparison that works with the Unicode Collation Algorithm. For example, this works in Perl:
use Unicode::Collate;
my $Collator = Unicode::Collate->new(normalization => undef, level => 1);
my $str = "Ich muß Perl studieren.";
my $sub = "MÜSS";
my $match;
if (my($pos,$len) = $Collator->index($str, $sub)) {
$match = substr($str, $pos, $len);
}
Matching things with and without marks is possible according to what level
of comparison you wish done.
How you perform basic Unicode operations like this in PHP I do not know, but I figure there must be a corresponding library, given how necessary these types of things are.
ä
is more than one byte or something like that - you have to do this:
preg_match("/ma?(ä)?(ä)?rz/i", $title);
Besides, Kengs approach is better.