PHP简单的HTML DOM解析器

I am working with simple web crawler. Below is simple html code i used to learn.

input.php

<ul id="nav">
    <li>
        <a href="www.google.com">Google</a>
        <ul>
            <li>
                <a href="mail.gmail.com">Gmail</a>
            </li>
        </ul>
    </li>
    <li>
        <a href="www.yahoo.com">Yahoo</a>
        <ul>
            <li>
                <a href="mail.yahoo.com">Yahoo Mail</a>
            </li>
        </ul>
    </li>
</ul>

I need to crawl the first anchor tag in ul[id=nav]->li. The code i used to crawl input.php is

<?php
    include 'simple_html_dom.php';
    $html = file_get_html('input.php');

    foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            echo $navUL_LI->find('a',0)->outertext."<br>";              
        }
    }
?>

It Displays all the anchor tag in my input.php. I need to display only google and yahoo. How can i achieve this?

In this case you can directly point it out with children() method. Example:

foreach($html->find('ul#nav') as $ul) {
    foreach($ul->children() as $li) {
        echo $li->children(0)->outertext . '<br/>';
    }
}

Alternatively, you can use DOMDocument + DOMXpath for this too:

$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXpath($dom);
// directly target those links
$links = $xpath->query('//ul[@id="nav"]/li/a');

foreach($links as $a) {
    echo $a->nodeValue . '<br/>';
}
<?php
    include 'simple_html_dom.php';
    $html = file_get_html('input.php');

    foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){
                echo $navUL_LI->find('a',0)->outertext."<br>";
                       }

        }
    }
?>

i have done the same work in Objective-c.

You can use the XML or HTML api's to serialize your html object.

If you want to do this form cold hand... find open tag and the close tag.

After this get first child, then the second and so on...

you can simply achieve that by:

<?php
      foreach ($html->find('ul[id=nav]') as $navUL){
        foreach ($navUL->find('li') as $navUL_LI){
            echo $navUL_LI->find('a',-2)->outertext."<br>";              
        }
    }
?>

Try this:

// get the children of the element #nav, i.e. the top level lis
$lis = $html->getElementById("#nav")->childNodes();
// for each child, find the first 'a' element
foreach ($lis as $li) {
    $a = $li->find('a',0);
    // retrieve the link text itself.
    echo "link text: " . $a->innertext() . "
";
}

See the simple-html-dom manual for details of all these methods.

<?php
$in = '<style>      .catalog-product-view .product.attribute.overview ul {         margin-top: 10px;     } </style><img src="/media/wysiwyg/img/misc/made-in-the-usa-doh-blue4.png"><ul><li>Ships as (12) 40 fl oz bottles</li></ul>';

function parseTags($input, $callback) {
    $len = strlen($input);
    $stack = [];

    $tag = "";
    $data = "";
    $isTag = false;
    $isString = false;
    for ($i=0; $i<$len; $i++) {
       $char = $input[$i];
       if ($char == '<') {
           $isTag = true;
           $tag .= $char;
       } else if ($char == '>') {
           $tag .= $char;
           if (substr($tag, 0, 2) == '</') {
               $close = str_replace('>', '', str_replace('</', '', explode(' ', $tag, 1)[0]));
               $open = str_replace('>', '', str_replace('<', '', explode(' ', end($stack), 1)[0]));
               if ($open == $close) {
                   $callback($tag, $data, $stack, $i, false);
                   array_pop($stack);
               }
           } else if (substr($tag, -2) == '/>') {
               $callback($tag, $data, $stack, $i, false);
           } else {
               $callback($tag, $data, $stack, $i, true);
               $stack[] = $tag;
           }
           $tag = "";
           $data = "";
           $isTag = false;
       } else if ($char == '"' || $char == "'") {
           if ($isString == false) {
               $isString = $char;
           } else if ($isString == $char && $input[$i-1] != '\\') {
               $isString = false;
           }
       } else if ($isTag) { 
           $tag .= $char; 
       } else { 
           $data .= $char; 
       }
    }
}

parseTags($in, function($tag, $data, $stack, $position, $isOpen) use (&$out) {
    print_r(func_get_args());
});