Consider this HTML, littered with any whitespace or irrelevant tags like div
and span
:
<div>
<span><a href="#1">Title 1</a></span>
<p>Paragraph 2</p>
<p>Outside 3
<a href="#4">Title 4</a>
</p>
</div>
How can I convert this into a PHP array of link and text pairs, in the same order as in the HTML.
{"#1", "Title 1" },
{null, "Paragraph 2"},
{null, "Outside 3" },
{"#4", "Title 4" },
The problem is that DOM searches like $html->find("a, p")
will capture 4 twice, once by itself and once inside 3.
I'm wondering if the solution is to traverse the document "linearly", as a human would read element by element from left to right, and if the node has text, you pick up the parent node's href
, if any.
If this is viable, how do you easily go through the DOM like this? Does anyone have a solution, preferably with Simple HTML DOM Parser or simple regexp, alternatively a built-in PHP framework.
I would look at https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator Which will help you recursevly traverse dom structure.
$dom = new DOMDocument();
$dom->loadHTML('<html>'.$htmlString.'</html>'); // wrap your initial html in <html></html> since it has to be well-formed
$dit = new RecursiveIteratorIterator(new RecursiveDOMIterator($dom));
$result = array();
foreach ($dit as $node) {
unset($r);
if(trim($node->nodeValue) == "" || $node->childNodes->length > 0){ // we look only non-empty last level nodes
continue;
}
$parent = $node->parentNode;
if($parent->nodeName == 'a'){
$r[0] = $parent->getAttribute('href');
}
$r[1] = $node->nodeValue;
$result[] = $r;
}
I found a non-DOM approach by pondering my own neatly organised example! By splitting each tag into lines, I can easily extract the information I want. It's maybe not "correct", but works as intended!
$array = preg_split("#(?=<)#", $html, 0, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
foreach($array as $key => $item) {
preg_match('/>\s*(\S.*)/', $item, $m);
preg_match('/href="([^"]*)/', $item, $n);
}