All the links in the web page http://php.net
were extracted with simplexml_import_dom in code1.
code1
<?php
$dom = new DOMDocument();
$dom->loadHTMLFile('http://php.net');
$xml = simplexml_import_dom($dom);
$nodes = $xml->xpath('//a[@href]');
foreach ($nodes as $node) {
echo $node['href'], "<br />
";
}
?>
Now i want parse the web page with DOMXPath,change simplexml_import_dom in code1 into DOMXPath in code2,there is a bug in code2 ,how to fix it?
code2
<?php
$html = file_get_contents('http://php.net');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[@href]');
foreach ($nodes as $node) {
echo $node['href'], "<br />
";
}
?>
returned data from query is objects not array!
if you get warning like :
Warning: DOMDocument::loadHTML(): Tag nav invalid in Entity
in output you can add this line before loadHTML function call
it because of html5 tag used in document
libxml_use_internal_errors(true);
code :
$html = file_get_contents('http://php.net');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[@href]');
foreach ($nodes as $node) {
echo $node->getAttribute("href"), "<br />
";
}
To parse all a href tags:
$sHtml = file_get_contents('http://php.net');
// var_dump( $sHtml );
$oDom = new DOMDocument( '1.0', 'utf-8' );
// Supress <DOCTYPE> notices
libxml_use_internal_errors(true);
$oDom->loadHTML('<?xml encoding="UTF-8">' . $sHtml );
// var_dump( $oDom );
$oXPath = new DOMXPath( $oDom );
$oNodes = $oXPath->query( '//a/@href' );
foreach( $oNodes as $oNode )
{
// var_dump( $oNode );
echo $oNode->nodeValue, "<br />
";
}
// Supress <DOCTYPE> notices
libxml_use_internal_errors(false);