I have the following input
<div style="s1">title1</div>
<div style="s1">content1</div>
<div style="s1">title2</div>
<div style="s1">content2</div>
I know title1
and title2
and I want to collect content1 and content2
I would need something like this:
<div style="s1">title1</div>.*?<div style="s1">(.*?)</div>
but since regexp is greedy, it matches until the end so it returns
content1</div>
<div style="s1">title2</div>
<div style="s1">content2
I would like to add to the pattern a list of tags that should not be included in the match.
Something like:
<div style="s1">title1</div>.*?<div style="s1">(.*?[^<div])</div>
where I refer with [^<div]
to a not contain stuff. This should be multiple options, probably with the use of |
How can I do it?
Now that that is out of the way, just do some dom manipulation and xpath:
$dom = new DOMDocument();
@$dom->loadHTML($html);
$x = new DOMXPath($dom);
foreach($x->query("//div") as $node)
{
if (trim($node->textContent) == 'title1')
{
$content['title1'] = $node->nextSibling->textContent;
}
}
Now wasn't that easy? So no more regexing html kay?
<div style="s1">title1</div>.*<div style="s1">(([^<]|<[^\/])*)</div>
Try this - it means find anything excepting < or < not followed by / - if you want, i can add there condition for sub-divs etc.
Just use the U option = ungreedy : http://.php.net/manual/fr/reference.pcre.pattern.modifiers.php