I'm trying to parse some HTML with PHP as an exercise, outputting it as just text, and I've hit a snag. I'd like to remove any tags that are hidden with style="display: none;"
- bearing in mind that the tag may contain other attributes and style properties.
The code I have so far is this:
$page = preg_replace("#<([a-z]+).*?style=\".*?display:\s*none[^>]*>.*?</\1>#s","",$page);`
The code it returning NULL
with a PREG_BACKTRACK_LIMIT_ERROR
.
I tried this instead:
$page = preg_replace("#<([a-z]+)[^>]*?style=\"[^\"]*?display:\s*none[^>]*>.*?</\1>#s","",$page);
But now it's just not replacing any tags.
Any help would be much appreciated. Thanks!
Using DOMDocument, you can try something like this:
$doc = new DOMDocument;
$doc->loadHTMLFile("foo.html");
$nodeList = $doc->getElementsByTagName('*');
foreach($nodeList as $node) {
if(strpos(strtolower($node->getAttribute('style')), 'display: none') !== false) {
$doc->removeChild($node);
}
}
$doc->saveHTMLFile("foo.html");
You should never parse HTML with Regex. That makes your eyes bleed. HTML is not regular in any form. It should be parsed by using a DOM-parser.