I need to take a string of html text like:
<p>This is a line with no spans<br>
This is a line <span class="second">This is secondary</span><br>
This is another line <span class="third">And this is third</span> <span class="four">this is four</span></p>
And have it end up as an array in PHP like:
array(
"This is a line with no spans",
array(
"This is a line",
second => "This is secondary",
),
array(
"This is another line",
third => "And this is third",
four => "this is four"
)
);
Getting each line into it's own value was easy, I just split the text on <br> and that works fine, but getting lines to be split with the class name I can't quite get. I feel like php's preg_split may hold the key, but I kind of suck with regular expressions and I can't get it figured out.
Any ideas?
It's not a good idea to use regular expressions to parse HTML (cite). It's just not a suitable tool; see @JAAulde's answer.
The best way is to do it purely with the DOM. Loop through all child nodes (including text nodes) to format the array the way you want. Like this:
$p = // get paragraph tag...
$lines = array();
$pChildren = $p->childNodes;
for ($i = 0; $i < $pChildren->length; $i++) {
$line = array();
$child = $pChildren->item($i);
if ($child instanceof DOMText) {
$line[] = $child->wholeText;
} elseif ($child instanceof DOMElement) {
if (strtolower($child->tagName) == 'br') {
$lines[] = $line;
$line = array();
} elseif (strtolower($child->tagName) == 'span' && $child->hasAttribute('class')) {
$line[$child->getAttribute('class')] = $child->nodeValue;
}
}
}
Warning: treat the above as pseudo-code, it has not been tested at all, just going from experience and the manual.
Maybe you can use an XML parser ? Here's the doc.
You should not attempt to parse HTML with regex or other means. It is very complicated and will end up with terrible maintenance problems.
I highly recommend you look into how to read a chunk of markup into a DOM document [docs] and then use DOM methods to work with it just like you would browser side.