I'm pretty new with NodeJs.
I'm trying to download some html from a website in order to parse it and present some information for debug.
I try with success with http module (see this post), but in this way when I print chunk:
var req = http.request(options, function(res) {
res.setEncoding("utf8");
res.on("data", function (chunk) {
console.log(chunk);
});
});
I don't get all html that is loaded dynamically with ajax for instance:
<div class="container">
::before
<div class="row">
::before
....
</div>
Are there any other module that can help me on this goal?
Thanks!
I would like to share with you my success (thanks to @oKonyk).
note that if you're running your script locally, you need to set this options:
options = { 'web-security': 'no' };
phantom.create({parameters: options}, function() {});
In order to capture dynamically built pages you have to render them in browser. There are several options to do that with node.js.
I would suggest using phantomjs, which is a so called headless browser.
In order to proof the concept you can install npm install phantomjs -g
globally. Create test script 'google.js' with following content:
var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.google.org', function(status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var html = page.evaluate(function() {
return document.getElementsByTagName('html')[0].innerHTML;
});
console.log(html);
}
phantom.exit();
});
Then run it as phantomjs google.js
You will get printed whole DOM of the page (at lest everything within <html>
tags), which different from raw response that you are getting with http
module.
Later you can use phantom
within your node project (more info here).