XML, unlike HTML, only knows four named entities: <
, >
, '
and "
.
I have been using XMLWriter in PHP to write lots of data to an XML file, and first I escape the desired text, which gives me some other entities, such as Â
and ¤
.
I have tried the following regex:
&(?!(apos|quot|[gl]t|amp);)
but it only matches the &
and not Â
or ¤t;
. What am I doing wrong?
If you add \w+;
to your expression, it will work:
&(?!(?:apos|quot|[gl]t|amp);)\w+;
But you are better off using the correct escaping function from the beginning that doesn't give you these problems.
Could you not just use strip_tags() (with a list of allowed tags) instead of htmlentities()?
Do not escape the entities yourself. Let the XMLWriter do the needed escaping.
$writer= new XMLWriter;
$writer->openMemory();
$writer->startDocument('1.0', 'UTF-8');
$writer->startElement('root');
$writer->text('A & B & <C>');
$writer->endElement();
$writer->endDocument();
echo $writer->outputMemory(TRUE);
Output:
<?xml version="1.0" encoding="UTF-8"?>
<root>A & B & <C></root>