解析HTML - 正则表达式是这种情况下唯一的选择吗？

A user will supply HTML, it may be valid or invalid (malformed). I need to be able to determine such things as:

Is there a style tag in the body
Is there a div that has a style attribute that makes use of width or background-image.

I have tried using the DOMDocument class but it can only do 1 and not 2 with xPath.

I have also tried simple_html_dom and that can only do 1 but not 2.

Do you think its a good idea that I just use regular expressions or is there something that I haven't thought of?

XPath can do both (1) and (2):

To test if there's a style tag in the body:

//body//style

To test if there's a div with a style attribute using width or background-image:

//div[contains(@style,'width:') or contains(@style,'background-image:')]

And, as you were curious about in your comments, seeing if a style tag contains a:hover or font-size:

//style[contains(text(),'a:hover') or contains(text(),'font-size:')]

Regex is NEVER (again: NEVER!) a solution for parsing HTML!

Regex can be used for Type-3 Chomsky languages (regular language).
HTML however is a Type-2 Chomsky language (context-free language).

If still in doubt: http://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy

To safely work with type-2 language you need a context free language parser. You might want to try a LL-parser or a recursive descent parser, e.g.

That being said:

Match body with style:

<body\s+[^>]*style\s*=\s*["'].*?[^"']*?["'][^>]*>

Match div with width|background-image in style:

<div\s+[^>]*style\s*=\s*["'][^"']*?(width|background-image)[^"']*?["'][^>]*>

They both falsely match said tags if commented out (which is why I said not possible).

You can use Tidy to clean up the HTML, then parse it as XML. Then it's easy to use xpath to find nodes. Try something like this:

$tidyConfig = array(
    "add-xml-decl" => true,
    "output-xml" => true,
    "numeric-entities" => true
);
$tidy = new tidy();
$tidy->parseString($html, $tidyConfig, "utf8");
$tidy->cleanRepair();
$xml = new SimpleXMLElement($tidy);
$matches = $xml->xpath('style');

As for parsing a style attribute to look for specific selectors, I think you'll have to do that manually. You could use a CSS parser if you want.

It's rarely a good idea to parse HTML with regex. However, any good HTML parser will be able to find all the divs with style tags, and regex could be useful for parsing the style attributes once you've done that.

It's still possible for complex (yet valid) CSS to break most regex, however, so the really durable thing here would be an HTML parser combined with a CSS parser. That could be overkill, though; a regex like \bwidth\s*:\s*(\w+) is likely to catch any width value unless someone's actively trying to fool it.

Edit:

A good HTML parser won't choke on anything that wouldn't choke a browser. I'm not a PHP guy anymore, but I've heard some good things about HTML Purifier.