php trim mb是否安全

I know that there is no mb_trim version of the trim. I have links to the dozen of articles for how to implement one using preg_replace.

The question I have, is the usual trim with default chars mb safe? That is, is there any example of multibyte character that ends with single byte whitespace char code?

It depends on the encoding you're talking about. Both UTF-16LE and UTF-32LE have tons of characters ending in null bytes, for example, which trim removes by default.

The string "a" in UTF-16LE consists of the bytes 0x61 0x00, and trim will remove the null byte leaving just 0x61.

Note that this problem goes the other way too, trim strips bytes from the beginning of strings as well as the end. If your string "a" is in UTF-16BE it will be encoded as 0x00 0x61 - with trim again leaving you with just 0x61.


Example:

$utf16le = iconv("ASCII", "UTF-16LE", "a"); 
$utf16be = iconv("ASCII", "UTF-16BE", "a");

var_dump(
  bin2hex($utf16le),
  bin2hex(trim($utf16le)),
  bin2hex($utf16be),
  bin2hex(trim($utf16be))
);

Output:

string(4) "6100"
string(2) "61"
string(4) "0061"
string(2) "61"

If you're only worried about UTF-8 then no, there aren't any conflicts. It is ASCII compatible and all single byte characters in UTF-8 are in the form of 0xxx xxxx while all bytes of a multibyte character have their most significant bit set, 1xxx xxxx, so there is no ambiguity. With UTF-8 trim using its default character mask is safe.

If you're concerned about other encodings then it's going to depend on what they are. If you try using multibyte characters as part of trim's character mask you'll definitely run into problems as each byte will be treated individually.

Since characters in default character mask (whitespace+\t \0\x0B) are ASCII, it is safe to use trim() with multibute string.

trim('  漢字は  '); // ok

Character mask with multibyte characters will cause problems.

trim('はは漢字はは', 'は'); // bad