I'm trying to make a web service in PHP for an app to communicate with that will get data from a database and put it into XML format for the app. One of the columns, however, contains HTML and needs to be outputted (I think) as CDATA. I'm having trouble accomplishing this though. Please advise
<?php
mysql_connect(DB_HOST, DB_USER, DB_PASSWORD);
mysql_select_db(DB_NAME);
$sql = "SELECT post_date_gmt, post_content, post_title FROM [schema].wp_posts WHERE post_status = \"publish\" && post_type = \"post\" ORDER BY post_date_gmt DESC;";
$res = mysql_query($sql);
$xml = new XMLWriter();
$xml->openURI("php://output");
$xml->startDocument();
$xml->setIndent(true);
$xml->startElement('BlogPosts');
while ($row = mysql_fetch_assoc($res)) {
$xml->startElement("Post");
$xml->startElement("PostDate");
$xml->writeRaw($row['post_date_gmt']);
$xml->endElement();
$xml->startElement("PostTitle");
$xml->$writeRaw($row['post_title']);
$xml->endElement();
$xml->startCData("PostContent");
$xml->writeCData($row['post_content']);
$xml->endCData();
$xml->endElement();
}
$xml->endElement();
header('Content-type: text/xml');
$xml->flush();
?>
Thank you very much in advance for any assistance you could offer!
Do not use XMLWriter::writeRaw()
, except if you really want to write XML fragments directly. "Raw" means that here will be no escaping from the library.
The correct way to write text into the XML document is XMLWriter::text()
.
$xml->startElement('PostTitle');
$xml->text('foo & bar');
$xml->endElement();
Output:
<?xml version="1.0"?>
<PostTitle>foo & bar</PostTitle>
If you use XMLWriter::writeRaw()
in this example the result would contain an unescaped &
and be invalid XML.
CDATA sections are character nodes not unlike text nodes, but allow special characters without escaping and keep whitespaces. You always have to create the element node separately. An element node can contain multiple other nodes, even multiple CDATA sections.
XmlReader has two ways to create CDATA sections:
A single method:
$xml->startElement("PostContent");
$xml->writeCData('<b>post</b> content');
$xml->endElement();
Output:
<?xml version="1.0"?>
<PostContent><![CDATA[<b>post</b> content]]></PostContent>
Or start/end methods:
$xml->startElement("PostContent");
$xml->startCData();
$xml->text('<b>post</b> content');
$xml->text(' more content');
$xml->endCData();
$xml->endElement();
Output:
<?xml version="1.0"?>
<PostContent><![CDATA[<b>post</b> content more content]]></PostContent>
You can just add it to the elements you need wrapped with CDATA like this:
$xml->writeRaw('<![CDATA['.$row['post_date_gmt'].']]>');
The answer by ThW is overall thoughtful and the way to go. It explains well how the interface of XMLWriter
in PHP is meant to be used.
Credits go to him as well for a large fraction of the work done for this differentiated answer as we discussed the question yesterday in chat.
There are some constrains with CDATA in XML however that also applies to the outlined two ways of using XMLWriter for CDATA:
The string ']]>' cannot be placed within a CDATA section, therefore, nested CDATA sections are not allowed (well-formedness constraint).
From: CDATA Section - compare 2.7 CDATA Sections
Normally XMLWriter accepts string data that is not encoded for the use. E.g. if you pass some text, it will get written properly encoded (unless the bespoken XMLWriter::writeRaw
).
But if you start a CDATA section and then write text or you write CDATA directly, the string passed must not end nor cotain another CDATA section. That means, it can not contain the character sequence "]]>
" as this would end the CDATA section prematurely.
So the responsibility to pass valid data to XMLWriter remains to the user of those methods.
It is normally trivial to do so (single-octets, US-ASCII based character set binary encodings and UTF-8 Unicode), here is some example code:
/**
* prepare text for CDATA section to prevent invalid or nested CDATA
*
* @param $string
*
* @return string
* @link http://www.w3.org/TR/REC-xml/#sec-cdata-sect
*/
function xmlwriter_prepare_cdata_text($string) {
return str_replace(']]>', ']]]]><![CDATA[>', (string) $string);
}
And a usage example:
$xml = new XMLWriter();
$xml->openURI("php://output");
$xml->startDocument();
$xml->startElement("PostContent");
$xml->writeCDATA(xmlwriter_prepare_cdata_text('<![CDATA[Foo & Bar]]>'));
$xml->endElement();
$xml->endElement();
Exemplary output:
<?xml version="1.0"?>
<PostContent><![CDATA[<![CDATA[Foo & Bar]]]]><![CDATA[>]]></PostContent>
DOMDocument btw. does something very similar under the hood already:
$dom = new DOMDocument();
$dom->appendChild(
$dom->createElement('PostContent')
);
$dom->documentElement->appendChild(
$dom->createCdataSection('<![CDATA[Foo & Bar]]>')
);
$dom->save("php://output");
Output:
<?xml version="1.0"?>
<PostContent><![CDATA[<![CDATA[Foo & Bar]]]]><![CDATA[>]]></PostContent>
To technically understand why XMLWriter in PHP behaves this way, you need to know that XMLWriter is based on the libxml2 library. The extension in PHP for most of the work done passes the calls through to libxml:
PHP's xmlwriter_write_cdata
delegates to libxml xmlTextWriterWriteCDATA
which does the suspected sequence of xmlTextWriterStartCDATA
, xmlTextWriterWriteString
and xmlTextWriterEndCDATA
.
xmlTextWriterWriteString
is used in many routines (e.g. writing PI) but only for some text-writing cases the content parameter string is encoded:
For all others, it's passed as-is. This includes CDATA, so the data passed to XMLWriter::writeCData
must match the requirements for XML CData (because that is written by that method):
CData ::= (Char* - (Char* ']]>' Char*))
Which is technically saying: Any string not containing "]]>
".
This can be easily oversighted, I myself suspected this could be a bug yesterday. And I'm not the only one, a related bug-report on PHP.net is: https://bugs.php.net/bug.php?id=44619 from years ago.
See as well What does <![CDATA[]]> in XML mean?