混合UTF-8和latin1表与PDO

There is an existing database/tables where i cannot change the charset. These tables use the collation "latin1_swedish_ci" but there is UTF-8 data stored inside. For example string "fußball" (german football) is saved as "fußball". That's the part i can not change.

My whole script works just fine with UTF-8 and it's own UTF-8 Tables and i use PDO(mySQL) with an UTF-8 Connection to query. But sometimes i have to query some "old" latin1 tables. Is there any "cool" way for solving this instead of sending SET NAMES.

This is my very first question at stackoverflow! :-)

  1. It's actually very easy to think that data is encoded in one way, when it is actually encoded in some other way: this is because any attempt to directly retrieve the data will result in conversion first to the character set of your database connection and then to the character set of your output medium—therefore you should first verify the actual encoding of your stored data through either SELECT BINARY myColumn FROM myTable WHERE ... or SELECT HEX(myColumn) FROM myTable WHERE ....

  2. Once you are certain that you have UTF-8 encoded data stored within a Windows-1252 encoded column (i.e. you are seeing 0xc39f where the character ß is expected), what you really want is to drop the encoding information from the column and then tell MySQL that the data is actually encoded as UTF-8. As documented under ALTER TABLE Syntax:

    Warning 

    The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:

    ALTER TABLE t1 CHANGE c1 c1 BLOB;
    ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
    

    The reason this works is that there is no conversion when you convert to or from BLOB columns.

  3. Henceforth MySQL will correctly convert selected data to that of the connection's character set, as desired. That is, if a connection uses UTF-8, no conversion will be necessary; whereas a connection using Windows-1252 will receive strings converted to that character set.

  4. Not only that, but string comparisons within MySQL will be correctly performed. For example, if you currently connect with the UTF-8 character set and search for 'fußball', you won't get any results; whereas you would after the modifications above.

  5. The pitfall to which you allude, of having to change numerous legacy scripts, only applies insofar as those legacy scripts are using an incorrect connection character set (for example, are telling MySQL that they use Windows-1252 whereas they are in fact sending and expecting receipt of data in UTF-8). You really should fix this in any case, as it can lead to all sorts of horrors down the road.

I solved it with creating another database handle in my DB class, that uses latin1 so whenever i need to query the "legacy tables" i can use

$pdo    = Db::getInstance();
$pdo->legacyDbh->query("MY QUERY");
# instead of
$pdo->dbh->query("MY QUERY");

if anyone has a better solution that also do not touch the tables.. :-)