PHP json_decode的UTF-8问题

EDIT2: The issue was with how my Perl client was interpreting the output from PHP's json_encode which outputs Unicode code points by default. Putting the JSON Perl module in ascii mode (my $j = JSON->new()->ascii();) made things work as expected.


I'm interacting with an API written in PHP that returns JSON, using a client written in Perl which then submits a modified version of the JSON back to the same API. The API pulls values from a PostgreSQL database whose encoding is UTF8. What I'm running in to is that the API returns a different character encoding, even though the value PHP receives from the database is proper UTF-8.

I've managed to reproduce what I'm seeing with a couple lines of PHP (5.3.24):

<?php
$val = array("Millán");
print json_encode($val)."
";

According to the PHP documentation, string literals are encoded ... in whatever fashion [they are] encoded in the script file.

Here is the hex dumped file encoding (UTF-8 lower case a-acute = c3 a1):

$ grep ill test.php | od -An -t x1c
  24  76  61  6c  20  3d  20  61  72  72  61  79  28  22  4d  69
   $   v   a   l       =       a   r   r   a   y   (   "   M   i
  6c  6c  c3  a1  6e  22  29  3b  0a
   l   l 303 241   n   "   )   ;  

And here is the output from PHP:

$ php -f test.php | od -An -t x1c
  5b  22  4d  69  6c  6c  5c  75  30  30  65  31  6e  22  5d  0a
   [   "   M   i   l   l   \   u   0   0   e   1   n   "   ]  

The UTF-8 lower case a-acute has been changed to a "Unicode" lower case a-acute by json_encode.

How can I keep PHP/json_encode from switching the encoding of this variable?

EDIT: What's interesting is that if I change the string literal to utf8_encode("Millán") then things work as expected. The utf8_encode docs say that function only supports ISO-8859-1 input, so I'm a bit confused about why that works.

This is entirely based on a misunderstanding. json_encode encodes non-ASCII characters as Unicode escape sequences \u..... These sequences do not reference any physical byte encoding in any UTF encoding, it references the character by its Unicode code point. U+00E1 is the Unicode code point for the character á. Any proper JSON parser will decode \u00e1 back into the character "á". There's no issue here.

try the below command to solve their problems.

<?php
$val = array("Millán");
print json_encode($val, JSON_UNESCAPED_UNICODE);

Note: add the JSON_UNESCAPED_UNICODE parameter to the json_encode function to keep the original values.

For python, this Saving utf-8 texts in json.dumps as UTF8, not as \u escape sequence