PHP将一串html拆分为一个类名为tag的数组

I need to take a string of html text like:

<p>This is a line with no spans<br>
This is a line <span class="second">This is secondary</span><br>  
This is another line <span class="third">And this is third</span> <span class="four">this is four</span></p>

And have it end up as an array in PHP like:

array(
    "This is a line with no spans",
    array(
      "This is a line",
      second => "This is secondary",
    ),
    array(
      "This is another line",
      third => "And this is third",
      four => "this is four"
    )
);

Getting each line into it's own value was easy, I just split the text on <br> and that works fine, but getting lines to be split with the class name I can't quite get. I feel like php's preg_split may hold the key, but I kind of suck with regular expressions and I can't get it figured out.

Any ideas?

It's not a good idea to use regular expressions to parse HTML (cite). It's just not a suitable tool; see @JAAulde's answer.

The best way is to do it purely with the DOM. Loop through all child nodes (including text nodes) to format the array the way you want. Like this:

$p = // get paragraph tag...
$lines = array();
$pChildren = $p->childNodes;
for ($i = 0; $i < $pChildren->length; $i++) {
    $line = array();
    $child = $pChildren->item($i);
    if ($child instanceof DOMText) {
        $line[] = $child->wholeText;
    } elseif ($child instanceof DOMElement) {
        if (strtolower($child->tagName) == 'br') {
            $lines[] = $line;
            $line = array();
        } elseif (strtolower($child->tagName) == 'span' && $child->hasAttribute('class')) {
            $line[$child->getAttribute('class')] = $child->nodeValue;
        }
    }
}

Warning: treat the above as pseudo-code, it has not been tested at all, just going from experience and the manual.

Maybe you can use an XML parser ? Here's the doc.

You should not attempt to parse HTML with regex or other means. It is very complicated and will end up with terrible maintenance problems.

I highly recommend you look into how to read a chunk of markup into a DOM document [docs] and then use DOM methods to work with it just like you would browser side.