I wanna to crawl pdf links. But some links that I get are double. How to remove the one of the double links? Thank you :)
<?php
<include 'simple_html_dom.php';
$url = 'http://scholar.google.com/scholar?hl=en&q=data+mining&btnG=&as_sdt=1%2C5&as_sdtp=';
$html = file_get_html($url) or die ('invalid url');
foreach($html->find('a') as $e) {
$link= $e->href;
if (preg_match('/\.pdf$/i', $link)) {
print_r($link);
}
}
?>
Put the links in an array and then use array_unique()
foreach($html->find('a') as $e) {
$link= $e->href;
if (preg_match('/\.pdf$/i', $link)) {
$links[] = $link;
}
}
$links = array_unique( $links );
$url = 'http://scholar.google.com/scholar?hl=en&q=data+mining&btnG=&as_sdt=1%2C5&as_sdtp=';
$html = file_get_html($url) or die ('invalid url');
$arr = array();
foreach($html->find('a') as $e) {
$link= $e->href;
if(strtolower(substr($link, strrpos($link, '.'))) === '.pdf')
$arr[] = $link;
}
array_unique($arr);
print_r($arr);