A user will supply HTML, it may be valid or invalid (malformed). I need to be able to determine such things as:
I have tried using the DOMDocument class but it can only do 1 and not 2 with xPath.
I have also tried simple_html_dom and that can only do 1 but not 2.
Do you think its a good idea that I just use regular expressions or is there something that I haven't thought of?
XPath can do both (1) and (2):
To test if there's a style tag in the body:
//body//style
To test if there's a div with a style attribute using width
or background-image
:
//div[contains(@style,'width:') or contains(@style,'background-image:')]
And, as you were curious about in your comments, seeing if a style tag contains a:hover
or font-size
:
//style[contains(text(),'a:hover') or contains(text(),'font-size:')]
Regex is NEVER (again: NEVER!) a solution for parsing HTML!
Regex can be used for Type-3 Chomsky languages (regular language).
HTML however is a Type-2 Chomsky language (context-free language).
If still in doubt: http://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy
To safely work with type-2 language you need a context free language parser. You might want to try a LL-parser or a recursive descent parser, e.g.
That being said:
Match body
with style
:
<body\s+[^>]*style\s*=\s*["'].*?[^"']*?["'][^>]*>
Match div
with width|background-image
in style
:
<div\s+[^>]*style\s*=\s*["'][^"']*?(width|background-image)[^"']*?["'][^>]*>
They both falsely match said tags if commented out (which is why I said not possible).
You can use Tidy to clean up the HTML, then parse it as XML. Then it's easy to use xpath to find nodes. Try something like this:
$tidyConfig = array(
"add-xml-decl" => true,
"output-xml" => true,
"numeric-entities" => true
);
$tidy = new tidy();
$tidy->parseString($html, $tidyConfig, "utf8");
$tidy->cleanRepair();
$xml = new SimpleXMLElement($tidy);
$matches = $xml->xpath('style');
As for parsing a style attribute to look for specific selectors, I think you'll have to do that manually. You could use a CSS parser if you want.
It's rarely a good idea to parse HTML with regex. However, any good HTML parser will be able to find all the div
s with style
tags, and regex could be useful for parsing the style attributes once you've done that.
It's still possible for complex (yet valid) CSS to break most regex, however, so the really durable thing here would be an HTML parser combined with a CSS parser. That could be overkill, though; a regex like \bwidth\s*:\s*(\w+)
is likely to catch any width
value unless someone's actively trying to fool it.
A good HTML parser won't choke on anything that wouldn't choke a browser. I'm not a PHP guy anymore, but I've heard some good things about HTML Purifier.