PHP - html_entity_decode没有解码所有内容

I am parsing an HTML page. At some point I am getting the text between a div and using html_entity_decode to print that text.

The problem is that the page contains characters like this star or others like shapes like ⬛︎, ◄, ◉, etc. I have checked and these characters are not encoded on the source page, they are like you see them normally.

The page is using charset="UTF-8"

So, when I use

html_entity_decode($string, ENT_QUOTES, 'UTF-8');

The star, for example, is "decoded" to â˜

$string is being obtained by using

document.getElementById("id-of-div").innerText

I would like to decode them correctly. How do I do that in PHP?

NOTE: I have tried htmlspecialchars_decode($string, ENT_QUOTES); and it produces the same result.

I've tried to reproduce your issue with this simple bit of PHP:

<?php
  // Make sure our client knows we're sending UTF-8
  header('Content-Type: text/plain; charset=utf-8');
  $string = "The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.";
  echo 'String: ' . $string . "
";
  echo 'Decoded: ' . html_entity_decode($string, ENT_QUOTES, 'UTF-8');

As expected, the output is:

String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a &quot;test&quot;.
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: This is a "test".

If I change the charset in the header to iso-8859-1, I see this:

String: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a &quot;test&quot;.
Decoded: The page contains characters like this star ★ or others like shapes like ⬛︎, ◄, ◉, etc. Here are some entities: <span>This is a "test".

So, I'd say that your issue is a display issue. The "interesting" characters are being left completely untouched by html_entity_decode, as you'd expect. It's just that whatever code you've got, or whatever you're using to look at your output, is using incorrectly using iso-8859-1 to display them.