I am parsing an HTML page. At some point I am getting the text between a div and using html_entity_decode to print that text.
The problem is that the page contains characters like this star ★
or others like shapes like ⬛︎, ◄, ◉, etc. I have checked and these characters are not encoded on the source page, they are like you see them normally.
The page is using charset="UTF-8"
So, when I use
html_entity_decode($string, ENT_QUOTES, 'UTF-8');
The star, for example, is "decoded" to â˜
$string is being obtained by using
document.getElementById("id-of-div").innerText
I would like to decode them correctly. How do I do that in PHP?
NOTE: I have tried htmlspecialchars_decode($string, ENT_QUOTES);
and it produces the same result.
I've tried to reproduce your issue with this simple bit of PHP:
<?php
// Make sure our client knows we're sending UTF-8
header('Content-Type: text/plain; charset=utf-8');
$string = "The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".";
echo 'String: ' . $string . "
";
echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');
As expected, the output is:
String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".
If I change the charset in the header to iso-8859-1
, I see this:
String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".
So, I'd say that your issue is a display issue. The "interesting" characters are being left completely untouched by html_entity_decode
, as you'd expect. It's just that whatever code you've got, or whatever you're using to look at your output, is using incorrectly using iso-8859-1 to display them.