如何正则表达式刮取HTML并忽略代码中的空格和换行符?

I'm putting together a quick script to scrape a page for some results and I'm having trouble figuring out how to ignore white space and new lines in my regex.

For example, here's how the page may present a result in HTML:

<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>

How would I change the following regex to ignore the spaces and new lines:

$regex = '/<td class="things"><div class="stuff"><p>(.*)<\/p><\/div><\/td>/i';

Any help would be appreciated. Help that also explains why you did something would be greatly appreciated!

Needless to caution you that you're playing with fire by trying to use regex with HTML code. Anyway to answer your question you can use this regex:

$regex='#^<td class="things">\s*<div class="stuff">\s*<p>(.*)</p>\s*</div>\s*</td>#si';

Update: Here is the DOM Parser based code to get what you want:

$html = <<< EOF
<td class="things">
    <div class="stuff">
        <p>I need to capture this text.</p>
    </div>
</td>
EOF;
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query("//td[@class='things']/div[@class='stuff']/p");
for($i=0; $i < $nodelist->length; $i++) {
    $node = $nodelist->item($i);
    $val = $node->nodeValue;
    echo "$val
"; // prints: I need to capture this text.
}

And now please refrain from parsing HTML using regex in your code.

SimpleHTMLDomParser will let you grab the content of a selected div or the contents of elements such as <p> <h1> <img> etc.

That might be a quicker way to achieve what your trying to do.

The solution is to not use regular expressions on HTML. See this great article on the subject: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Bottom line is that HTML is not a regular language, so regular expressions are not a good fit. You have variations in white space, potentially unclosed tags (who is to say the HTML you are scraping is going to always be correct?), among other challenges.

Instead, use PHP's DomDocument, impress your friends, AND do it the right way every time:

  // create a new DOMDocument
    $doc = new DOMDocument();

    // load the string into the DOM
    $doc->loadHTML('<td class="things"><div class="stuff"><p>I need to capture this text.</p></div></td>');

    // since we are working with HTML fragments here, remove <!DOCTYPE 
    $doc->removeChild($doc->firstChild);            

    // likewise remove <html><body></body></html> 
    $doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);

    $contents = array();
    //Loop through each <p> tag in the dom and grab the contents
    // if you need to use selectors or get more complex here, consult the documentation
    foreach($doc->getElementsByTagName('p') as $paragraph) {
        $contents[] = $paragraph->textContent;
    } 

   print_r($contents);

Documentation

This PHP extension is regarded as "standard", and is usually already installed on most web servers -- no third-party scripts or libraries required. Enjoy!