I am working with simple web crawler. Below is simple html code i used to learn.
input.php
<ul id="nav">
<li>
<a href="www.google.com">Google</a>
<ul>
<li>
<a href="mail.gmail.com">Gmail</a>
</li>
</ul>
</li>
<li>
<a href="www.yahoo.com">Yahoo</a>
<ul>
<li>
<a href="mail.yahoo.com">Yahoo Mail</a>
</li>
</ul>
</li>
</ul>
I need to crawl the first anchor tag in ul[id=nav]->li
. The code i used to crawl input.php is
<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
?>
It Displays all the anchor tag in my input.php. I need to display only google and yahoo. How can i achieve this?
In this case you can directly point it out with children()
method. Example:
foreach($html->find('ul#nav') as $ul) {
foreach($ul->children() as $li) {
echo $li->children(0)->outertext . '<br/>';
}
}
Alternatively, you can use DOMDocument
+ DOMXpath
for this too:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXpath($dom);
// directly target those links
$links = $xpath->query('//ul[@id="nav"]/li/a');
foreach($links as $a) {
echo $a->nodeValue . '<br/>';
}
<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
}
?>
i have done the same work in Objective-c.
You can use the XML or HTML api's to serialize your html object.
If you want to do this form cold hand... find open tag and the close tag.
After this get first child, then the second and so on...
you can simply achieve that by:
<?php
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',-2)->outertext."<br>";
}
}
?>
Try this:
// get the children of the element #nav, i.e. the top level lis
$lis = $html->getElementById("#nav")->childNodes();
// for each child, find the first 'a' element
foreach ($lis as $li) {
$a = $li->find('a',0);
// retrieve the link text itself.
echo "link text: " . $a->innertext() . "
";
}
See the simple-html-dom manual for details of all these methods.
<?php
$in = '<style> .catalog-product-view .product.attribute.overview ul { margin-top: 10px; } </style><img src="/media/wysiwyg/img/misc/made-in-the-usa-doh-blue4.png"><ul><li>Ships as (12) 40 fl oz bottles</li></ul>';
function parseTags($input, $callback) {
$len = strlen($input);
$stack = [];
$tag = "";
$data = "";
$isTag = false;
$isString = false;
for ($i=0; $i<$len; $i++) {
$char = $input[$i];
if ($char == '<') {
$isTag = true;
$tag .= $char;
} else if ($char == '>') {
$tag .= $char;
if (substr($tag, 0, 2) == '</') {
$close = str_replace('>', '', str_replace('</', '', explode(' ', $tag, 1)[0]));
$open = str_replace('>', '', str_replace('<', '', explode(' ', end($stack), 1)[0]));
if ($open == $close) {
$callback($tag, $data, $stack, $i, false);
array_pop($stack);
}
} else if (substr($tag, -2) == '/>') {
$callback($tag, $data, $stack, $i, false);
} else {
$callback($tag, $data, $stack, $i, true);
$stack[] = $tag;
}
$tag = "";
$data = "";
$isTag = false;
} else if ($char == '"' || $char == "'") {
if ($isString == false) {
$isString = $char;
} else if ($isString == $char && $input[$i-1] != '\\') {
$isString = false;
}
} else if ($isTag) {
$tag .= $char;
} else {
$data .= $char;
}
}
}
parseTags($in, function($tag, $data, $stack, $position, $isOpen) use (&$out) {
print_r(func_get_args());
});