用简单的Dom模型进行php刮擦

include('simple_html_dom.php');

  function curl_set($url){
   $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    $result = curl_exec($ch); 
    return $result;  
   }

    $curl_scraped_page = curl_set('http://www.belmontwine.com/site-map.html');
    $html = new simple_html_dom();
    $html->load($curl_scraped_page, true, false);

    $i = 0; 
    $ab = array();
    $files = array();
         foreach($html->find('td[class=site-map]') as $td) {
           foreach($td->find('li a') as $a) {
         if($i<=2){
               $ab = 'http://www.belmontwine.com'.$a->href;
                   $html = file_get_html($ab);
            foreach($html->find('td[class=pageheader]') as $file) {
               $files[] = $file->innertext;
           }

          } 
        else{
          //exit();
         }    
          $i++;
        }
        $html->clear();
     }

print_r($files);

Above is my code i need help to scrap site with php.

$ab variable contain the urls that are scraped from the site.i want to scrap data from those URL. I don't know whats wrong with script. The desired output be the url passed by $ab.. but it is not returning anything..just a continous loop...

Need help with it

You have a run away program because once you are inside the if($i<=2) section you never increment the i variable. Right now your i++ is in the wrong place. I don't know why you want to limit the finds to 3 or less but you need to remember to reset the i variable to 0 also, which you are not doing at all.

EDIT:

I don't use the class 'simple_html_dom.php' so I don't know it very well. And I don't know what you want to do with each link found. And I can't do the work for you. I came up with this sample php script that grabs all the links from your site-map page. It creates an array consisting of the link title and href path. The last foreach loop just prints the array for now but you could use that loop to process each path found.

include('simple_html_dom.php');

$files = array();
$html = file_get_html('http://www.belmontwine.com/site-map.html');
foreach($html->find('td[class=site-map]') as $td)
{
 foreach($td->find('li a') as $a)
 {
  if($a->plaintext != '')
  {
   $files["$a->plaintext"] = "http://www.belmontwine.com/$a->href";
  }
 }
}
// To print $files array or to process each link found
foreach($files as $title => $path)
{
 echo('Title: ' . $title . ' - Path: ' . $path . '<br>' . PHP_EOL);
}

Also, not every link found is an html file, at least 1 is a pdf so be sure to test for that in your code.