I am using PHP to try and scrape a page that seems to dynamically load content just milliseconds after the parent page finishes loading.
I am using curl to parse the page, and simpleHtmlDom to snatch things from the parsed html.
My efforts to traverse the DOM and explode() things out of the html return nothing. My only ideas were that it was loading the content after the parent page was loaded.
Here is my code.
<?
$url = 'http://www.facebook.com/OneAndroidAppaDay';
$scrapeUrl = 'http://www.facebook.com/OneAndroidAppaDay';
include_once('simple_html_dom.php');
require_once("bitly.php");
$userAgent = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL,$scrapeUrl);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}
$appBitlyUrl = $html->find('div[class=UIStoryAttachment_Title]',0)->find('a',0)->href; // fail :(
echo 'Bitly Url: ' . $appBitlyUrl;
?>
It's bombing out at line 24 (denoted with the inline comment) with this error:
Fatal error: Call to a member function find() on a non-object in /home/xxxxxxxx/public_html/xxx.xx/xxxx.php on line 24
Is there a way to make it wait a second or two before it snatches the page's html? Or maybe someone has some better insight?
Thanks
Mark
to do a simple delay
sleep(2); // 2 second delay before continuing
You should really re-read the error message. It doesn't stem from a timing issue.
You get a $html string from curl. But you cannot invoke phphtmldom functions ->find on that right away. You'll have to parse it before traversing. Also it's unclear why you are using curl in the first place. Either use just $dom = str_get_html($html)
or try:
$dom = file_get_html('http://www.facebook.com/OneAndroidAppaDay');
$bituurl = $dom->find('div[class=UIStoryAttachment_Title]',0)->...