I'm trying to extract the first occurence of a link that starts like this
https://encrypted-tbn3.gstatic.com/images?...
from the source code of a page. The link starts and ends with a ". Here is what I've got so far:
$search_query = $array[0]['Name'];
$search_query = urlencode($search_query);
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents( "https://www.google.com/search?q=$search_query&tbm=isch", false, $context);
$html = str_get_html($response);
$url = explode('"',strstr($html, 'https://encrypted-tbn3.gstatic.com/images?'[0]))
However the output of $url is not the link I try to extract, but something very different. I have added an image.
Could anyone explain the output to me and how I would get the desired link? Thanks
It seems that you're using PHP Simple HTML DOM Parser
.
I normally use DOMDocument
, which is part of php
build-in classes.
Here's a working example of what you need:
$search_query = $array[0]['Name'];
$search_query = urlencode($search_query);
$context = stream_context_create(array('http' => array('header' => 'User-Agent: Mozilla compatible')));
$response = file_get_contents( "https://www.google.com/search?q=$search_query&tbm=isch", false, $context);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($response);
foreach ($dom->getElementsByTagName('img') as $item) {
$img_src = $item->getAttribute('src');
if (strpos($img_src, 'https://encrypted') !== false) {
print $img_src."
";
}
}
Output:
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSumjp6e37O_86nc36mlktuWpbFuCI4nkkkocoBCYW3qCOicqdu_KEK-MY
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcR_ttK8NlBgui_JndBj349UxZx0kHn0Z-Essswci-_5UQCmUOruY1PNl3M
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSydaTpSDw2mvU2JRBGEYUOstTUl4R1VhRevv1Sdinf0fxRvU26l3pTuqo
...
$url_beginning = 'https://encrypted-tbn3.gstatic.com/images?';
if(preg_match('/\"(https\:\/\/encrypted\-tbn3\.gstatic\.com\/images\?.+?)\"/ui',$html, $matches))
$url = $matches[1];
else
$url = '';
try to use preg_replace, it is more suitable for parsing
And in this eample a assumed that url in your HTML should be quoted.
UPD a little bit tuned version to be usable for any url-beginning:
$url_beginning = 'https://encrypted-tbn3.gstatic.com/images?';
$url_beginning = preg_replace('/([^а-яА-Яa-zA-Z0-9_@%\s])/ui', '\\\\$1', $url_beginning);
if(preg_match('/\"('.$url_beginning.'.+?)\"/ui',$html, $matches))
$url = $matches[1];
else
$url = '';