I'm trying to use strip_tags
and trim
to detect if a string contains empty html?
$description = '<p> </p>';
$output = trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
var_dump($output);
string 'Â ' (length=2)
My debug to try figure this out:
$description = '<p> </p>';
$test = mb_detect_encoding($description);
$test .= "
";
$test .= trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
$test .= "
";
$test .= html_entity_decode($description, ENT_QUOTES, 'UTF-8');
file_put_contents('debug.txt', $test);
Output: debug.txt
ASCII
<p> </p>
If you use var_dump(urlencode($output))
you'll see that it outputs string(6) "%C2%A0"
hence the charcodes are 0xC2 and 0xA0. These two charcodes are unicode for "non-breaking-space". Make sure your file is saved in UTF-8 format and your HTTP headers are UTF-8 format.
That said, to trim this character you can use regex with the unicode modifier (instead of trim):
DEMO:
<?php
$description = '<p> </p>';
$output = trim(strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
var_dump(urlencode($output)); // string(6) "%C2%A0"
// -------
$output = preg_replace('~^\s+|\s+$~', '', strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
var_dump(urlencode($output)); // string(6) "%C2%A0"
// -------
$output = preg_replace('~^\s+|\s+$~u', '', strip_tags(html_entity_decode($description, ENT_QUOTES, 'UTF-8')));
// Unicode! -----------------------^
var_dump(urlencode($output)); // string(0) ""
Regex autopsy:
~
- the regex modifier delimiter - must be before the regex, and then before the modifiers^\s+
- the start of the string immediately followed by one or more whitespaces (one or more whitespace characters in the start of the string) - (^
means start of the string, \s
means a whitespace character, +
means "matched 1 to infinity times")|
- OR\s+$
- one or more whitespace characters immediately followed by the end of the string (one or more whitespace characters in the end of the string)~
- the ending regex modifier delimiteru
- the regex modifier - here using the unicode modifier (PCRE_UTF8
) to make sure we replace unicode whitespace characters.