正则表达式。 在特定标签之间查找标签

There are an html code that contains many hrefs. But I don't need all of hrefs. I want to get only hrefs contained in the div:

<div class="category-map second-links"> 
*****
</div> <p class="sec">

what i want to see as a result:

<a href='xxx'>yyy</a>
<a href='zzz'>www</a>
...

My version (not working):

(?<=<div class=\"category-map second-links\">)(.+?(<a href=\".+?".+?>.+<\/a>))+(?=<\/div> <p class="sec">)

If you load your HTML into a DOM document, you can use Xpath to query nodes from it.

All a elements inside the document:

  • //a

That have an ancestor/a parent div element:

  • //a[ancestor:div]

With the class attribute category-map second-links

  • //a[ancestor::div[@class = "category-map second-links"]]

Get the href attributes of the filtered a elements (Optionally)

  • //a[ancestor::div[@class = "category-map second-links"]]/@href

Full Example:

$html = <<<'HTML'
<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href='xxx'>yyy</a>
        <a href='zzz'>www</a>
...
    </div>
<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href='aaa'>bbb</a>
        <a href='ccc'>ddd</a>
...
    </div>
</div> <p class="sec">
HTML;

$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);

// fetch the href attributes
$hrefs = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]/@href') as $node) {
  $hrefs[] = $node->value;
}
var_dump($hrefs);

// fetch the a elements an read some data from them
$linkData = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]') as $node) {
  $linkData[] = array(
    'href' => $node->getAttribute('@href'),
    'text' => $node->nodeValue,
  );
}
var_dump($linkData);

// fetch the a elements and store their html
$links = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]') as $node) {
  $links[] = $dom->saveHtml($node);
}
var_dump($links);

Use simpledomhtml

// Create DOM from URL
$html = file_get_html('<YOU_WEBSITE_URL_HERE>');

// Find specific tag
foreach($html->find('div.category-map.second-links a') as $anchor) {
    $anchors[] = $anchor;
}

print_r($anchors);

If you want to use Regex then you will probably use two regex queries one for getting all divs and second in each div find href.

Because in single query like this

"<div.*?<a href='(?<data>.*?)'.*?</div>"

You will get only one href if any div has more than one.

So you can do this using dom

$dom->find('div a')->attrib('href');

I am not sure above dom is %100 working but I give this to you as hint hope you can make right one for you

Disclaimer : you're better off using a proper html parser. This answer is for educational purposes, although it's quite reliable than your common regex if it's valid html :P

Regex is awesome

So I decided to do this in two parts:

  • Match everything that's in <div class="category-map second-links"></div> even if it's nested.
  • Loop through these matches, and match for <a></a>, I chose to keep it simple since I don't expect links to be nested.

The hard part

So here's the regex, we'll be using a recursive pattern and the xsi modifiers :

<div\s+class\s*=\s*"\s*category-map\s+second-links\s*"\s*>    # match a certain div with a certain classes
(?:                                                           # non-capturing group
   (?:<!--.*?-->)?                                            # Match the comments !
   (?:(?!</?div[^>]*>).)                                      # check if there is no start/closing tag
   |                                                          # or (which means there is)
   (?R)                                                       # Recurse the pattern, it's the same as (?0)
)*                                                            # repeat zero or more times
</div\s*>                                                     # match the closing tag
(?=.*?<p\s+class\s*=\s*"\s*sec\s*"\s*>)                       # make sure there is <p class="sec"> ahead of the expression

Modifiers:

  • s : makes a dot metacharacter in the pattern matches all characters, including newlines.
  • x : whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include commentary inside complicated patterns.
  • i : match case-insensitive

The easy part

Matching unnested a tags isn't that difficult if there isn't some crazy stuff like <a title="</a>"></a>:

<a[^>]*>    # match the beginning a tag
.*?         # match everything ungreedy until ...
</a\s*>     # match </a       > or </a>
# Not forgetting the xsi modifiers

Wrapping everything up in PHP

$input = '<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href=\'xxx\'>yyy</a>
        <a href=\'zzz\'>www</a>
...
    </div>
<div class="category-map second-links"> 
*****
    <!--<div class="category-map second-links"> Comment hacks --> 
    <div class="category-map second-links">
        <a href=\'aaa\'>bbb</a>
        <a href=\'ccc\'>ddd</a>
...
    </div>
</div> <p class="sec">';

$links = array();

preg_match_all('~
<div\s+class\s*=\s*"\s*category-map\s+second-links\s*"\s*>    # match a certain div with a certain classes
(?:                                                           # non-capturing group
   (?:<!--.*?-->)?                                            # Match the comments !
   (?:(?!</?div[^>]*>).)                                      # check if there is no start/closing tag
   |                                                          # or (which means there is)
   (?R)                                                       # Recurse the pattern, it\'s the same as (?0)
)*                                                            # repeat zero or more times
</div\s*>                                                     # match the closing tag
(?=.*?<p\s+class\s*=\s*"\s*sec\s*"\s*>)                       # make sure there is <p class="sec"> ahead of the expression
~sxi', $input, $matches);

if(isset($matches[0])){
    foreach($matches[0] as $match){
        preg_match_all('~
                            <a[^>]*>    # match the beginning a tag
                            .*?         # match everything ungreedy until ...
                            </a\s*>     # match </a       > or </a>
                        ~isx', $match, $tempLinks);
        if(isset($tempLinks[0])){
            array_push($links, $tempLinks[0]);
        }
    }
}

if(isset($links[0])){
    print_r($links[0]);
}else{
    echo 'empty :(';
}

Online demo's

Hard part Easy part PHP code

References