I know that there is no mb_trim
version of the trim
. I have links to the dozen of articles for how to implement one using preg_replace
.
The question I have, is the usual trim
with default chars mb safe? That is, is there any example of multibyte character that ends with single byte whitespace char code?
It depends on the encoding you're talking about. Both UTF-16LE and UTF-32LE have tons of characters ending in null bytes, for example, which trim
removes by default.
The string "a" in UTF-16LE consists of the bytes 0x61
0x00
, and trim
will remove the null byte leaving just 0x61
.
Note that this problem goes the other way too, trim
strips bytes from the beginning of strings as well as the end. If your string "a" is in UTF-16BE it will be encoded as 0x00
0x61
- with trim
again leaving you with just 0x61
.
Example:
$utf16le = iconv("ASCII", "UTF-16LE", "a");
$utf16be = iconv("ASCII", "UTF-16BE", "a");
var_dump(
bin2hex($utf16le),
bin2hex(trim($utf16le)),
bin2hex($utf16be),
bin2hex(trim($utf16be))
);
Output:
string(4) "6100"
string(2) "61"
string(4) "0061"
string(2) "61"
If you're only worried about UTF-8 then no, there aren't any conflicts. It is ASCII compatible and all single byte characters in UTF-8 are in the form of 0xxx xxxx
while all bytes of a multibyte character have their most significant bit set, 1xxx xxxx
, so there is no ambiguity. With UTF-8 trim
using its default character mask is safe.
If you're concerned about other encodings then it's going to depend on what they are. If you try using multibyte characters as part of trim
's character mask you'll definitely run into problems as each byte will be treated individually.
Since characters in default character mask (whitespace+\t \0\x0B
) are ASCII, it is safe to use trim()
with multibute string.
trim(' 漢字は '); // ok
Character mask with multibyte characters will cause problems.
trim('はは漢字はは', 'は'); // bad