I'm trying to import data from CSV files into a web app that uses utf-8 encoding. I'm using fgetcsv (I don't have to if there's a better way). I'm using utf8_encode to attempt to translate characters from whatever the file's encoding is. When I call mb_check_encoding on strings that come out of this particular file, I get 'ASCII'.
There are a few strange characters in the input. utf8_encode deals happily with é characters (where before they were coming out as black diamond question marks). However, it fails to translate double quotes and apostrophes, and instead just removes them.
Help much appreciated, thanks. I'm using CakePHP, in case that gives me some more options!
Edit - I meant utf8_encode, not utf8_decode.
You only need one call to iconv
with the correct charset for the $in_charset
parameter.
$utf8Text = iconv($inputCharset, 'UTF-8', $text);
You need to know the input charset. There's no way around it. Make a specification that all input needs to be in ISO-8859-1, or whatever you prefer. Alternatively, find out what the charset of your input is (ask the author, test yourself in an editor, whatever). Alternatively, require that the input needs to specify what encoding it's in somewhere, somehow.
Encoding is not black magic. You just need to be aware of what encoding some text is in and what encoding you want it to be in. Then use a function like iconv
that can cleanly translate the characters from one encoding to another. utf8_encode
and utf8_decode
translate between ISO-8859-1 and UTF-8. Their names are chosen terribly, since they suggest they can automagically translate anything from and to UTF-8, but that's not the case.
You can fix the problem of strange characters by using the function below:
function htmlallentities($str){
$res = '';
$strlen = strlen($str);
for($i=0; $i<$strlen; $i++){
$byte = ord($str[$i]);
if($byte < 128) // 1-byte char
$res .= $str[$i];
elseif($byte < 192); // invalid utf8
elseif($byte < 224) // 2-byte char
$res .= '&#'.((63&$byte)*64 + (63&ord($str[++$i]))).';';
elseif($byte < 240) // 3-byte char
$res .= '&#'.((15&$byte)*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
elseif($byte < 248) // 4-byte char
$res .= '&#'.((15&$byte)*262144 + (63&ord($str[++$i]))*4096 + (63&ord($str[++$i]))*64 + (63&ord($str[++$i]))).';';
}
return $res;
For example, for apostrophe (') i used the following code snippet:
$value = "What’s your name?";
$value = htmlallentities(utf8_decode($value));
$str = "⿿";
$str2 = "'";
$value = str_replace($str, $str2, $value);
$value = mysql_real_escape_string($value);
Will be glad if those help you.