I have a MySQL query that returns data for formatting to an XML file. One of the columns is a free text field that can contain strange characters that "breaks" the XML with an encoding error. I believe these characters are a strange " quotes that made it into a record from pasted Microsoft Word when the user originally input the record. I do not have control over that process.
Strange Character example:
“TURN KEY – Totally Furnishedâ€
I am using htmlspecialchars to "clean" this data and it basically removes the field entirely from XML record and makes it blank for that record. This fixes the encoding issue but that record is now missing data for that field. I still want that data, I just want to omit or even change weird characters to something like a dash.
$description = htmlspecialchars($row['PropertyInformation'], ENT_QUOTES, 'UTF-8');
The XML output ends up like this in the records where the weird characters are occurring:
<DESCRIPTIF>
<![CDATA[ ]]>
</DESCRIPTIF>
The htmlspecialchars
function returns an empty string if the input string contains an invalid code unit sequence within the given encoding, unless either the ENT_IGNORE
or ENT_SUBSTITUTE
flags are set.
The ENT_IGNORE
flag silently discards invalid code unit sequences instead of returning an empty string. Using this flag is discouraged as it may have security implications.
The ENT_SUBSTITUTE
falg replaces invalid code unit sequences with a Unicode Replacement Character U+FFFD (UTF-8) or &#FFFD; (otherwise) instead of returning an empty string.
You could try to set one of these flags.
htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE);
Looks like you forgot to capitalize utf-8
$description = htmlspecialchars($row['PropertyInformation'], ENT_QUOTES, 'UTF-8');
/**
* Clean a string from non-printable chars
*
* @param string $string
* @return string
*/
function str_clean($string)
{
return preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $string);
}
$string = '“TURN KEY – Totally Furnishedâ€';
echo htmlspecialchars(str_clean($string), ENT_QUOTES, 'UTF-8');