I am trying to detect the encoding of a given string in order to convert it later on to utf-8 using iconv. I want to restrict the set of source encodings to utf8, iso8859-1, windows-1251, CP437
//...
$acceptedEncodings = array('utf-8',
'iso-8859-1',
'windows-1251'
);
$srcEncoding = mb_detect_encoding($content, $acceptedEncodings, true);
if($srcEncoding)
{
$content = iconv($srcEncoding, 'UTF-8', $content);
}
//...
The problem is thet mb_detect_encoding does not seem to accept CP437 as a supported encoding and when I give it a CP437 encoded string this is classified as iso-8859-1 which causes iconv to ignore characters like ü.
My question is: Is there a way to detect CP437 encoding earlier? The conversion from CP437 to UTF-8 using iconv works fine but I just cannot find the proper way to detect CP437.
Thank you very much.
As has been discussed countless times before: it is fundamentally impossible to distinguish any single-byte encoding from any other single-byte encoding. What you get are a bunch of bytes. In encoding A the byte x42
may map to character X and in encoding B the same byte may map to character Y. But nothing about the blob of bytes you have tells you that, because you only have the bytes. They can mean anything. They're equally valid in all encodings. It's possible to identify more complex multi-byte encodings like UTF-8, since they need to follow more complex internal rules. So it's possible to definitely be able to say This is not valid UTF-8. However, it is impossible to say with 100% certainty This is definitely UTF-8, not ISO-8859.
You need to have meta data about the content you receive which tells you what encoding the content is in. It's not practical to identify it after the fact. You'd need to employ actual content analysis to figure out which encoding a piece of text makes the most sense in.