I need to parse an xml document that I receive from a third party using php. I am not able to ask the maintainers of the document to fix its structure. When I parse the document using simplexml_load_file
the XML documen is empty.
Here is a stripped down example of what I am seeing.
my-file.xml:
<?xml version="1.0" encoding="utf-8"?>
<DataSet>
<diffgr:diffgram xmlns:diffgr="urn:schemas-microsoft-com:xml-diffgram-v1">
aaa
</diffgr:diffgram>
</DataSet>
And I process it like this (from the command line):
php > $xml = simplexml_load_file('my-file.xml');
php > print_r($xml);
SimpleXMLElement Object
(
)
I was expecting that the xml structure is displayed through print_r
.
Indeed, when I remove the namespace declaration, things seem to work (despite some expected XML parse warnings):
my-file-nonamespace.xml:
<?xml version="1.0" encoding="utf-8"?>
<DataSet>
<diffgr:diffgram>
aaa
</diffgr:diffgram>
</DataSet>
Processing it the same way on the command line (with warnings removed):
php > $xml = simplexml_load_file('my-file-nonamespace.xml');
// a bunch of xml parse warnings
php > print_r($xml);
SimpleXMLElement Object
(
[diffgr:diffgram] =>
aaa
)
So, the problem has to do with an invalid namespace declaration. I can probably use a regular expression on the file to remove the namespace declaration before parsing, but that is not a direction I want to go.
What is the best way to properly parse the first document in PHP?
The issue is not that the data isn't loaded, but the fact that the child elements are in a different namespace.
$xml = simplexml_load_file('my-file.xml');
var_dump($xml->children("diffgr", true));
This selects the children from a specific namespace from the current element.
Note that you should use the URI as the prefix may change, but this is just to show that the data is there.
Edit: If the XML has issues, then the first stage is to ignore the errors and then check what is loaded ...
libxml_use_internal_errors(true);
$xml = simplexml_load_file('my-file.xml');
echo $xml->asXML();
This will give you an idea of what state the result is and even if it loads. A quick example is...
libxml_use_internal_errors(true);
$xml = simplexml_load_file('my-file.xml');
echo $xml->asXML();
var_dump($xml->children());
With..
<?xml version="1.0" encoding="utf-8"?>
<DataSet>
<diffgr:diffgram>
aaa
</diffgr:diffgram>
</DataSet>
Notice how the namespace is there, but the namespace isn't declared. The output is...
<?xml version="1.0" encoding="utf-8"?>
<DataSet>
<diffgr:diffgram>
aaa
</diffgr:diffgram>
</DataSet>
/home/nigel/workspace2/Test/t1.php:22:
class SimpleXMLElement#2 (1) {
public $diffgr:diffgram =>
string(11) "
aaa
"
}
This outputs the children without having to use the namespace.