I have a spider class which on a user request spiders websites for content. Each search results in loading about 30 websites, spidering them for the information and then standardizing this information.
I have written this in PHP using CURL, since PHP is lacking multitasking I would like to switch to Java (I am aware of the multi process curl which does not suit my demand). I need a http client which can POST/GET, receive and set cookies as well as modify HTTP headers.
I have found HtmlUnit which seems nifty but also exceeds my demand, and since the package is relatively big and I will have many hundread requests a minute I don't want to have an overkill solution slowing down my servers.
Do you think this would be an issue and do you have other suggestions to replace CURL in Java? Should I use the Java CURL binding? This is a question of efficiency and server load.
Perhaps take a look at Apache Http Client ?
You can create a HttpClient per thread and use that to do your requests
while (running) {
HttpClient client = new DefaultHttpClient();
HttpGet GET = new HttpGet("mydomain.com/path.html");
HttpResponse response = client.execute(GET);
// do stuff with response
}
Even better, if you re-use the HttpClient between requests, it will remember the cookies sent back on previous responses, and automatically apply them to your next request. In that sense a single HttpClient models a http conversation.
So if you did
client.execute(GET1);
// cookies received in response
client.execute(GET2);
// the second get will send the cookies back received from GET1 response.
You could then take a look at Java's ExecutorService that will make it easy to place spider jobs and have multiple threads running.
Ultimately you will need to evaluate potential solutions to see what best suits your need.
HtmlUnit offers a rich Api, for parsing web pages, and finding and evaluating elements on the page.
A simpler solution would be to simply use HttpClient (which HtmlUnit uses under the hood). This would simply download the entire page and return it as a OutputStream or String. You can then use regular expressions to find links etc. probably more like you are doing currently with curl.
try http://code.google.com/p/crawler4j/ simple and efficient solution when you don t need javascript.