用正则表达式解析网页源代码

I can't seem to figure out the regular expression I need in order to parse the following.

<div id="MustBeInThisId">
   <div class="ValueFromThisClass">
      The Value I need
   </div>
</div>

As you can see I have a wrapping div with an id. That div contain multiple other divs but only one of those divs I need the value from.

If you are trying to extract some data from an HTML document, you should not use regular expressions.

Instead, you should use a DOM Parser : those are made exactly for that.


In PHP, you would use the DOMDocument class, and its DOMDocument::loadHTML() method, to load the HTML content.


Then, you can work with methods such as :

You can even work with DOMXpath to execute XPath queries on your HTML content -- which will allow you to search for pretty much anything in it.


In your case, I suppose that something like this should do the trick.

First, get your HTML content into a string (or use DOMDocument::loadHTMLFile()) :

$html = <<<HTML
<p>hello</p>
<div>
    <div id="MustBeInThisId">
    <div class="ValueFromThisClass">
        The Value I need
    </div>
    </div>
<div>
HTML;

Then, load it to a DOMDocument instance :

$dom = new DOMDocument();
$dom->loadHTML($html);

Instanciate a DOMXPath object, and use it to query your DOM object :
My XPath expression might be a bit more complex than necessary... I'm not really good with those...

$xpath = new DOMXPath($dom);
$items = $xpath->query('//div[@id="MustBeInThisId"]/div[@class="ValueFromThisClass"]');

And, finally, work with the results of that query :

if ($items->length > 0) {
    var_dump( trim( $items->item(0)->nodeValue ) );
}

And here is your result :

string 'The Value I need' (length=16)

Use something like simplehtmldom - it will make your life much, much easier.

$html = str_get_html($source_code);
$tag = $html->find("#MustBeInThisId .ValueFromThisClass", 0);
$the_value_i_need = $tag->innertext;

Regex can't parse HTML since HTML isn't a regular language. You should use DOMDocument.

Then you get nice functions like getElementById :)

Or try a javascript library like JQuery. I think it's the easiest way to do vhat you want.