There are an html code that contains many hrefs. But I don't need all of hrefs. I want to get only hrefs contained in the div:
<div class="category-map second-links">
*****
</div> <p class="sec">
what i want to see as a result:
<a href='xxx'>yyy</a>
<a href='zzz'>www</a>
...
My version (not working):
(?<=<div class=\"category-map second-links\">)(.+?(<a href=\".+?".+?>.+<\/a>))+(?=<\/div> <p class="sec">)
If you load your HTML into a DOM document, you can use Xpath to query nodes from it.
All a elements inside the document:
//a
That have an ancestor/a parent div element:
//a[ancestor:div]
With the class attribute category-map second-links
//a[ancestor::div[@class = "category-map second-links"]]
Get the href attributes of the filtered a elements (Optionally)
//a[ancestor::div[@class = "category-map second-links"]]/@href
Full Example:
$html = <<<'HTML'
<div class="category-map second-links">
*****
<!--<div class="category-map second-links"> Comment hacks -->
<div class="category-map second-links">
<a href='xxx'>yyy</a>
<a href='zzz'>www</a>
...
</div>
<div class="category-map second-links">
*****
<!--<div class="category-map second-links"> Comment hacks -->
<div class="category-map second-links">
<a href='aaa'>bbb</a>
<a href='ccc'>ddd</a>
...
</div>
</div> <p class="sec">
HTML;
$dom = new DOMDocument();
$dom->loadHtml($html);
$xpath = new DOMXpath($dom);
// fetch the href attributes
$hrefs = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]/@href') as $node) {
$hrefs[] = $node->value;
}
var_dump($hrefs);
// fetch the a elements an read some data from them
$linkData = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]') as $node) {
$linkData[] = array(
'href' => $node->getAttribute('@href'),
'text' => $node->nodeValue,
);
}
var_dump($linkData);
// fetch the a elements and store their html
$links = array();
foreach ($xpath->evaluate('//a[ancestor::div[@class = "category-map second-links"]]') as $node) {
$links[] = $dom->saveHtml($node);
}
var_dump($links);
Use simpledomhtml
// Create DOM from URL
$html = file_get_html('<YOU_WEBSITE_URL_HERE>');
// Find specific tag
foreach($html->find('div.category-map.second-links a') as $anchor) {
$anchors[] = $anchor;
}
print_r($anchors);
If you want to use Regex then you will probably use two regex queries one for getting all divs and second in each div find href.
Because in single query like this
"<div.*?<a href='(?<data>.*?)'.*?</div>"
You will get only one href if any div has more than one.
So you can do this using dom
$dom->find('div a')->attrib('href');
I am not sure above dom is %100 working but I give this to you as hint hope you can make right one for you
Disclaimer : you're better off using a proper html parser. This answer is for educational purposes, although it's quite reliable than your common regex if it's valid html :P
So I decided to do this in two parts:
<div class="category-map second-links"></div>
even if it's nested.<a></a>
, I chose to keep it simple since I don't expect links to be nested.So here's the regex, we'll be using a recursive pattern and the xsi
modifiers :
<div\s+class\s*=\s*"\s*category-map\s+second-links\s*"\s*> # match a certain div with a certain classes
(?: # non-capturing group
(?:<!--.*?-->)? # Match the comments !
(?:(?!</?div[^>]*>).) # check if there is no start/closing tag
| # or (which means there is)
(?R) # Recurse the pattern, it's the same as (?0)
)* # repeat zero or more times
</div\s*> # match the closing tag
(?=.*?<p\s+class\s*=\s*"\s*sec\s*"\s*>) # make sure there is <p class="sec"> ahead of the expression
Modifiers:
s
: makes a dot metacharacter in the pattern matches all characters, including newlines.x
: whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped #
outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x
modifier, and makes it possible to include commentary inside complicated patterns.i
: match case-insensitiveMatching unnested a
tags isn't that difficult if there isn't some crazy stuff like <a title="</a>"></a>
:
<a[^>]*> # match the beginning a tag
.*? # match everything ungreedy until ...
</a\s*> # match </a > or </a>
# Not forgetting the xsi modifiers
$input = '<div class="category-map second-links">
*****
<!--<div class="category-map second-links"> Comment hacks -->
<div class="category-map second-links">
<a href=\'xxx\'>yyy</a>
<a href=\'zzz\'>www</a>
...
</div>
<div class="category-map second-links">
*****
<!--<div class="category-map second-links"> Comment hacks -->
<div class="category-map second-links">
<a href=\'aaa\'>bbb</a>
<a href=\'ccc\'>ddd</a>
...
</div>
</div> <p class="sec">';
$links = array();
preg_match_all('~
<div\s+class\s*=\s*"\s*category-map\s+second-links\s*"\s*> # match a certain div with a certain classes
(?: # non-capturing group
(?:<!--.*?-->)? # Match the comments !
(?:(?!</?div[^>]*>).) # check if there is no start/closing tag
| # or (which means there is)
(?R) # Recurse the pattern, it\'s the same as (?0)
)* # repeat zero or more times
</div\s*> # match the closing tag
(?=.*?<p\s+class\s*=\s*"\s*sec\s*"\s*>) # make sure there is <p class="sec"> ahead of the expression
~sxi', $input, $matches);
if(isset($matches[0])){
foreach($matches[0] as $match){
preg_match_all('~
<a[^>]*> # match the beginning a tag
.*? # match everything ungreedy until ...
</a\s*> # match </a > or </a>
~isx', $match, $tempLinks);
if(isset($tempLinks[0])){
array_push($links, $tempLinks[0]);
}
}
}
if(isset($links[0])){
print_r($links[0]);
}else{
echo 'empty :(';
}