consider this simple code:
echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');
it prints
`e
instead of just
e
do you know what I am doing wrong?
nothing changed after adding setlocale
setlocale(LC_COLLATE, 'en_US.utf8');
echo iconv('UTF-8', 'ASCII//TRANSLIT', 'è');
I have this standard function to return valid url strings without the invalid url characters. The magic seems to be in the line after the //remove unwanted characters comment.
This is taken from the Symfony framework documentation: http://www.symfony-project.org/jobeet/1_4/Doctrine/en/08 which in turn is taken from http://php.vrana.cz/vytvoreni-pratelskeho-url.php but i don't speak Czech ;-)
function slugify($text)
{
// replace non letter or digits by -
$text = preg_replace('#[^\\pL\d]+#u', '-', $text);
// trim
$text = trim($text, '-');
// transliterate
if (function_exists('iconv'))
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
}
// lowercase
$text = strtolower($text);
// remove unwanted characters
$text = preg_replace('#[^-\w]+#', '', $text);
if (empty($text))
{
return 'n-a';
}
return $text;
}
echo slugify('é'); // --> "e"
When doing transliteration, you have to make sure that your LC_COLLATE is properly set, otherwise the default POSIX will be used.
I'm tempted to say "nothing", although this is a little outside my expertise. PHP's iconv() is notorious, and the inspiration for many workarounds, including
Read the comments for iconv() documentation for more inspiration. (Or commiseration. Too close to call.)
It seems the standard way to handle this is with a "removing accents" function which you can find in library's like flourish or CMS's like Wordpress. Iconv seems to be unable to translate accents (and rightly so) since this isn't a good idea for anything other than URL slugs.
cf @tchrist, with INTL php extension
http://fr2.php.net/manual/en/book.intl.php
preg_replace('/\pM*/u','',normalizer_normalize( $mystring, Normalizer::FORM_D));
eéèêëiîïoöôuùûüaâäÅ Ἥ ŐǟǠ ǺƶƈƉųŪŧȬƀ␢ĦŁȽŦ ƀǖ becomes
eeeeeiiiooouuuuaaaA Η OaA AƶƈƉuUŧOƀ␢ĦŁȽŦ ƀu
As tchrist emphasises, not all unicode characters are considered decomposable:
extract from Unicode charts:
U0080.pdf
00CF Ï LATIN CAPITAL LETTER I WITH DIAERESIS
≡ 0049 I 0308 ¨
NB this symbol « ≡ » indicate an available decomposition
00D0 Ð LATIN CAPITAL LETTER ETH
→ 00F0 ð latin small letter eth
→ 0110 Đ latin capital letter d with stroke
→ 0189 Ɖ latin capital letter african d
no decomposition available, IMHO strangely (we could consider ASCII letter D as an acceptable equivalent).
U0100.pdf
0110 Đ LATIN CAPITAL LETTER D WITH STROKE
→ 00D0 Ð latin capital letter eth
→ 0111 đ latin small letter d with stroke
→ 0189 Ɖ latin capital letter african d
even stranger: this one is identified as LATIN CAPITAL LETTER D (with stroke), but not decomposable as such! Perhaps a cooler solution should be to get the unicode description of each char, and compare it with the description of each ascii char (and replace accordingly). Anyone? ;-]
It happen with me with pure iconv without php. The Trick was to set LANG environment value to en_US.UTF-8 (it was hu_HU.UTF-8 before, in my case). After it worked as expected.
It seem that it depend of the php version...
php -version
PHP 7.0.0RC8 (cli) (built: Nov 25 2015 12:36:50) ( NTS ) Copyright (c) 1997-2015 The PHP Group Zend Engine v3.0.0, Copyright (c) 1998-2015 Zend Technologies with Zend OPcache v7.0.6-dev, Copyright (c) 1999-2015, by Zend Technologies
php -r "var_dump(iconv('UTF-8', 'ASCII//TRANSLIT', 'è'));"
string(2) "`e"
php -version
PHP 7.0.8-1~dotdeb+8.1 (cli) ( NTS ) Copyright (c) 1997-2016 The PHP Group Zend Engine v3.0.0, Copyright (c) 1998-2016 Zend Technologies with Zend OPcache v7.0.8-1~dotdeb+8.1, Copyright (c) 1999-2016, by Zend Technologies
php -r "var_dump(iconv('UTF-8', 'ASCII//TRANSLIT', 'è'));"
string(1) "e"