将HTML中的文本和链接对解析为具有相同顺序的PHP数组

Consider this HTML, littered with any whitespace or irrelevant tags like div and span:

<div>
<span><a href="#1">Title 1</a></span>
<p>Paragraph 2</p>
<p>Outside 3
<a href="#4">Title 4</a>
</p>
</div>

How can I convert this into a PHP array of link and text pairs, in the same order as in the HTML.

{"#1", "Title 1"    },
{null, "Paragraph 2"},
{null, "Outside 3"  },
{"#4", "Title 4"    },

The problem is that DOM searches like $html->find("a, p") will capture 4 twice, once by itself and once inside 3.

I'm wondering if the solution is to traverse the document "linearly", as a human would read element by element from left to right, and if the node has text, you pick up the parent node's href, if any.

If this is viable, how do you easily go through the DOM like this? Does anyone have a solution, preferably with Simple HTML DOM Parser or simple regexp, alternatively a built-in PHP framework.

I would look at https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator Which will help you recursevly traverse dom structure.

$dom = new DOMDocument();
$dom->loadHTML('<html>'.$htmlString.'</html>'); // wrap your initial html in <html></html> since it has to be well-formed
$dit = new RecursiveIteratorIterator(new RecursiveDOMIterator($dom));
$result = array();
foreach ($dit as $node) {
    unset($r);
    if(trim($node->nodeValue) == "" || $node->childNodes->length > 0){ // we look only non-empty last level nodes
        continue;
    }
    $parent = $node->parentNode;
    if($parent->nodeName == 'a'){
        $r[0] = $parent->getAttribute('href');
    }
    $r[1] = $node->nodeValue;
    $result[] = $r;
}

I found a non-DOM approach by pondering my own neatly organised example! By splitting each tag into lines, I can easily extract the information I want. It's maybe not "correct", but works as intended!

$array = preg_split("#(?=<)#", $html, 0, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

foreach($array as $key => $item) {
    preg_match('/>\s*(\S.*)/', $item, $m);
    preg_match('/href="([^"]*)/', $item, $n);
}