I want to make a search on google, using php or node.js... I not yet decided that, it depends about what answer for this question is easier to implement (the rest of what I want to do is easy in both languages).
After make this consultation I want to process the result, get the links, the number of results (only with the number of results could be great)...
The searching is for a url image.
Any suggestion??
Google has implemented lots of safeguards to ensure that it's search engine can't be scraped. However, Google must still work, that's the whole point. So the best way to do google scraping I've found so far is to control a real web browser.
There's Selenium if you want to go that route. However, I prefer my programs to be self-contained than needing to depend on an installed web browser (I run most of my programs on headless servers). So I prefer using phantomjs which is a full webkit based browser (like Safari and Konqueror) driven javascript.
Phantomjs scripts tend to be verbose however so most people use it with a wrapper such as casperjs, node-horseman or nightmarejs (there are lots more, search npm).
Here's an example of google scraping from node-horseman web page:
var Horseman = require('node-horseman');
var horseman = new Horseman();
var numLinks = horseman
.open('http://www.google.com')
.type('input[name="q"]', 'github')
.click("button:contains('Google Search')")
.waitForNextPage()
.count("li.g");
console.log("Number of links: " + numLinks);
horseman.close();
If you know how to inspect a page with the developer tools, you'll know how to write a scraper using phantomjs.
One word of warning. Don't download google search too frequently otherwise google will probably detect your script as a bot and temporarily ban you. Make sure you wait an appropriate amount of time between searches.
You need to use proxies to avoid being banned. Private proxies work best, the more you have the faster you can scrape, 10-50 with a delay or low thread count. If you can afford 100+ then you can really fly