I have on the site something like this:
<div class="latestItemIntroText">
<div class="itemLinks">
<div class="share">Share</div>
<div class="dummy-div"></div>
<div class="addthis_sharing_toolbox"></div>
</div>
Lorem ipsum <br /><br />
Lorem ipsum <br /><br />
Lorem ipsum <br /><br />
Lorem ipsum <br /><br />
</div>
I need to have this text Lorem ipsum only. I tryed to do this regex code like this:
</div>([\s?]+[^<]+[<br?/?>]*[^<]+[<br?/?>]*[^<]+[<br?/?>]*[^<]+)</div>
I saw that this part I repeat many times :
[^<]+[<br?/?>]*
--> because I don't know how many times there will be br with lorem pisum, maybe one, maybe 10 times... is there a possibility to short this regex?
Using Regex for HTML String is not a good approach, instead use DOMDocument
for this.
<?php
ini_set('display_errors', 1);
$string = <<<HTML
<div class="latestItemIntroText">
<div class="itemLinks">
<div class="share">Share</div>
<div class="dummy-div"></div>
<div class="addthis_sharing_toolbox"></div>
</div>
Lorem ipsum <br /><br />
Lorem ipsum <br /><br />
Lorem ipsum <br /><br />
Lorem ipsum <br /><br />
</div>
HTML;
$domDocument = new DOMDocument();
$domDocument->loadHTML($string);
$domXPath = new DOMXPath($domDocument);
$results = $domXPath->query('//div[@class="itemLinks"]');
$toRemove[]=$results->item(0);
foreach($toRemove as $removal)
{
$removal->parentNode->removeChild($removal);
}
$results = $domXPath->query('//div[@class="latestItemIntroText"]');
print_r($results->item(0)->textContent);
This simpler regex does work for your input. All the usual caveats about the million different ways this would break do apply.
^(?!.*(?:<div|</div>))(.+?)(?=<br\s?/>|$)
<br/>
or EOL, via positive lookahead.