If I have a webpage like this:
<body>
<header>
<a href='http://domain1.com'>link 1 text</a>
</header>
<a href='http://domain2.com'>link 2 text</a>
<footer>
<a href='http://domain3.com'>link 3 text</a>
</footer>
</body>
How do I pull the <a>
tags out of the <body>
but exclude the links from <header>
and <footer>
?
In the real web page, there will be a lot of <a>
tags in the <header>
so I'd rather not have to cycle through ALL of them.
I want to pull out the URLs and anchor text from each of the <a>
tags that are NOT inside the <header>
or <footer>
tags.
EDIT: this is how I find links in the header:
$header = $html->find('header',0);
foreach ($header->find('a') as $a){
do something
}
I would like to do this (note the use of "!")
$foo = $html->find('!header,!footer');
foreach ($foo->find('a') as $a){
do something
}
Remove the header and footer from the DOM you are working with before looking for the links.
<?php
include("simple_html_dom.php");
$source = <<<EOD
<body>
<header>
<a href='http://domain1.com'>link 1 text</a>
</header>
<a href='http://domain2.com'>link 2 text</a>
<a href='http://domain4.com'>link 4 text</a>
<footer>
<a href='http://domain3.com'>link 3 text</a>
</footer>
</body>
EOD;
$html = str_get_html($source);
foreach ($html->find('header, footer') as $unwanted) {
$unwanted->outertext = "";
}
$html->load($html->save());
$links = $html->find("a");
foreach ($links as $link) {
print $link;
};
?>
It's not possible with simple-html-dom, in simply way of course. You can't do this with simple-html-dom.
$html->find('body > a');
This Css selector Selects all <a>
elements where the parent is a <body>
element.
You need to loop through body's children nodes and then get <a>
I suggest to look at How do you parse and process HTML/XML in PHP?
For my part, I'm using Symfony/DomCrawler and Symfony/CssSelector to do this.
Without mangling the body? You could do something like:
$bad_as = $html->find('header a, footer a');
foreach($html->find('a') as $a){
if(in_array($a, $bad_as)) continue;
// do something
}