如何在网络中解决编码问题[关闭]

I have developed web applications for 1 year where the environment is(apache,php,html5 and javascript).I didn't care before this time to encoding issues.It was like the magic.I merely set mysql connection in a configuration file to UTF-8 and forget.
But after my experience grew I faced cases where I need to fully understand how encoding issues are solved in this environment.When I get something,I miss the others.So I couldn't get the full picture.
To be clear:

  • Javascript Strings are UTF-16
  • HTTP transport encoding is arbitrary.
  • PHP Strings are Binary.
  • Mysql connection also is arbitrary.

How the conversion is done and what are the possible issues I need to care about.

The issue only really exists at the boundary between two systems. Within one system (PHP, a database, Javascript etc. individually) there are clear ways how to deal with encodings and there's usually little friction there. The issue typically is that a blob of binary data is transferred from one system to another, say PHP to the database, without correctly specifying in the accompanying metadata what encoding that blob is in.

Different interfaces between systems have different ways of specifying that required meta data. The interface between PHP and the database allows to specify a connection encoding, which makes the database "understand" what encoding is being used by PHP and what encoding it expects in return. Between the web server/PHP and the browser there are HTTP headers and/or HTML meta tags which allow this metadata to be specified.

One system always has to yield. Either one system is sending data in a specific encoding, telling the recipient what encoding that is and the recipient will have to deal with it; or the recipient can specify upfront what encoding it needs and the sender will have to convert it to that encoding before sending.

  • between PHP and most databases, PHP specifies the encoding to be used and the database converts text on the fly to that encoding
  • between web servers/PHP and browsers, the server/PHP declares what encoding the content it's sending is in and the browser has to deal with it
  • the server also declares what encoding it expects if the browser is sending any data through the accept-charset attribute of forms, or the browser infers it from the content it received
  • Javascript receives text after the browser has dealt with it, so it doesn't really care about encodings

They are not arbitrary, they are what you tell them to be. So in general, you can make everything UTF-8, and you're fine.

Javascript strings are only UTF-16 internally. The JS files you send, can be UTF-8. If you tell your browser which encoding a response has, the browser will be able to convert to whatever encoding it needs. The trouble only starts when you specify a different encoding than you actually send.

Why UTF-8?

  • Every ANSI encoding has problems specifying all characters (there's only room for 256 different characters).
  • UTF-8 is usually the most compact unicode format, especially when you transport western languages.
  • UTF-8 is the only unicode encoding that isn't affected by Byte Order Mark differences between Linux and Windows, although that is also something you don't usually have to worry about.

Recommended reading: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)