I may be saying this with incorrect terminology so correct me if I'm wrong please.
Here's what I want to do: I'm trying to scrape a website's comments section but the comments are loaded via an ajax call after the page has fully loaded. When I try to scrape the HTML from the site via:
res, err:= http.Get(url)
if err != nil {
// handle error
}
defer res.Body.Close()
But it obviously gets the html before the ajax call. How do I go about getting the html after the ajax call?
This is completely off the top of my head, but would I need to basically create a js-renderer in code for this? My guess is that the JS needs to execute somehow. Any suggestions / libraries / examples on how to go about this? I'd prefer this to be in go but it could be realistically in any language.
If you own the site or can easily determine (or generate) the URI of the call that loads the comments, it's probably easier to make that same AJAX call yourself rather than bother with DOM parsing or arbitrary JS execution.
At that point Go would actually be a good language to use, since its JSON and XML standard libraries are excellent for unmarshalling that kind of data.
you can use the headless browsers like http://phantomjs.org/ to get page, execute all javascripts on it and scrape the comments. This example can help : https://github.com/ariya/phantomjs/blob/master/examples/phantomwebintro.js
But phantomjs is separate binary application, maybe installing it can be not so trivial.
Also you you can research the page using Firebug, see the requests being send to fetch comments, and emulate this call in go.
Maybe the page loads comments via javascript code like this
$.get( "/ajax/comments", function( data ) {
$( ".comments" ).html( data );
});
so you can fetch and parse the /ajax/comments
page using go
Recently I had the same issue and GoQuery helped a lot I tried the first site came from the net, where comments are loaded by JS event and wrote you a small snippet. You may try and check it out.
doc,_ := goquery.NewDocument("http://www.ihg.com/holidayinn/hotels/us/en/san-francisco/sfocc/hoteldetail/hotel-reviews?scmisc=hotel_details_reviews_link_bottom")
html_contents,_ := doc.Html()
fmt.Println(html_contents)
This will initially shows all the comments below main content of the page, which are loaded by JS event.
Good Luck!