The PHP library lacks a mb_ord()
function... That is, something that do what ord() function do, but for UTF8 (or "mb" multibyte, so "mb_ord"). I used some clues from here,
$ord = hexdec( bin2hex($utf8char) ); //decimal
and I suppose that mb_substr($text, $i, 1, 'UTF-8')
gets "1 utf8-char"... But $ord not returns the values that we expect.
This code not works: not shows code like 177 (plusmn).
$msg = '';
$text = "... a UTF-8 long text... Ą ⨌ 2.5±0.1; 0.5±0.2 ...";
$allOrds = array();
for($i=0; $i<mb_strlen($text, 'UTF-8'); $i++) {
$utf8char = mb_substr($text, $i, 1, 'UTF-8'); // 1=1 unicode character?
$ord = hexdec( bin2hex($utf8char) ); //decimal
if ($ord>126) { //non-ASCII
if (isset($allOrds[$ord])) $allOrds[$ord]++; else $allOrds[$ord]=1;
}
}
foreach($allOrds as $o=>$n)
$msg.="
entity #$o occurs $n times";
echo $msg;
OUTPUT
entity #50308 occurs 1 times
entity #14854284 occurs 1 times
entity #49841 occurs 2 times
So (see entities table), 49841 is not 177, and 14854284 (iiiint) is not 10764.
something that do what ord() function do, but for UTF8
For that you'd first need to define what exactly that is. ord
gives you the numerical value of a byte. This is often confused as "value of the character", but since encodings are a complex topic that makes no sense. So, ord
== numerical value of a byte. What would you expect the "MB version of ord
" to do then exactly?
Anyway, what you're getting is the numeric value of two (or more) bytes. Say, the character "漢" in UTF-8 is encoded as the three bytes E6 BC A2
. That's what bin2hex
gives you. hexdec
then translates that to decimal, which is a pretty large number. That number has absolutely nothing to do with the Unicode code point 6F22, which you're really after. That is because the UTF-8 encoding needs a few more extra bytes to encode this code point, hence U+6F22 (漢) does not translate into the bytes 6F 22
.
You have already linked to another question which does what you want:
list(, $ord) = unpack('N', mb_convert_encoding($utf8Character, 'UCS-4BE', 'UTF-8'));
This essentially uses the same logic, but bases it on the UCS-4 encoding, in which code points happen to correspond to bytes quite nicely.