使用Curl从html表中获取信息

i need to get some information about some plants and put it into mysql table. My knowledge on Curl and DOM is quite null, but i've come to this:

    set_time_limit(0);
include('simple_html_dom.php');


$ch = curl_init ("http://davesgarden.com/guides/pf/go/1501/"); 

curl_setopt($ch, CURLOPT_USERAGENT,"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;     rv:1.9.0.1) Gecko/2008070208 Firefox/3.0.1");
curl_setopt($ch, CURLOPT_HTTPHEADER, array("Accept-Language: es-es,en"));
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_BINARYTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT,0); 
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$data = curl_exec ($ch); 
curl_close ($ch); 


$html= str_get_html($data);


$e = $html->find("table", 8);

 echo $e->innertext;

now, i'm really lost about how to move in from this point, can you please guide me?

Thanks!

This is a mess.

But at least it's a (somewhat) consistent mess.

If this is a one time extraction and not a rolling project, personally I'd use quick and dirty regex on this instead of simple_html_dom. You'll be there all day twiddling with the tags otherwise.

For example, this regex pulls out the majority of title/data pairs:

$pattern = "/<b>(.*?)</b>\s*<br>(.*?)</?(td|p)>/si";

You'll need to do some pre and post cleaning before it will get them all though.

I don't envy you having this task...

Your best bet will be to wrape this in php ;)

Yes, this is a ugly hack for a ugly html code.

<?php
ob_start();
system("
    /usr/bin/env links -dump 'http://davesgarden.com/guides/pf/go/1501/' |
    /usr/bin/env perl -lne 'm/((Family|Genus|Species):\s+\w+\s+\([\w-]+\))/ && \
        print $1'
");
$out = ob_get_contents();
ob_end_clean();
print $out;
?>

Use Simple Html Dom and you would be able to access any element/element's content you wish. Their api is very straightforward.

you can try somthing like this.

<?php 
$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);

$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[@id="tbl-all-product-view"]/tr[@class!="rowH"]');
foreach($table_rows as $row => $tr) {
    foreach($tr->childNodes as $td) {
        $data[$row][] = preg_replace('~[
]+~', '', trim($td->nodeValue));
    }
    $data[$row] = array_values(array_filter($data[$row]));
}

echo '<pre>';
print_r($data);
?>