I'm coding a little app that start from a URL and look on all links in that specific page. Next, it will go on all links and scrape the contents but will show only a specific content (numbers with 10 or more char). This is my code but it retrieve blank page, what is wrong?
//I
$url = 'http://xxx.xxx';
$str = file_get_contents($url);
$original_file = file_get_contents($url);
$stripped_file = strip_tags($original_file, "<a>");
preg_match_all("/<a(?:[^>]*)href=\"([^\"]*)\"(?:[^>]*)>(?:[^<]*)<\/a>/is", $stripped_file, $matches);
$links = $matches[1];
//print_r($links);
//F
//F
$count = count($links);
for($i=0;$i<=$count;$i++)
{
//I
$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_URL,$links[$i]);
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1');
$query = curl_exec($curl_handle);
curl_close($curl_handle);
preg_match_all('/\b3\d+/', $query, $matches2);
$numbers = $matches2[0];
$count = 0;
foreach($numbers as $value) {
if(strlen((string)$value) >= 10) echo '<br><br>[' . $count++ . "]" . $value;
}
//F
}
//F
Issue#1: Your html can have urls like following from where it is picking the links as /home/test.php
without the base http://www.example.com/
. So before requesting with curl, print it on screen or browser and check what it is.
<a href="/home/test.php">link</a>
Issue#2: 2
seconds for CURLOPT_CONNECTTIMEOUT
can prove less for you. So try increasing this value.
curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 10);
If the problem still persists, please show us a sample page link. And a sample internal link for which you get the blank response.