如何从运行时生成的网页中提取HTML标记

I am using a SimpleHTMLDOM parser to extract HTML data from web pages. But I came across websites such as www.coursera.com wherein the webpage is generated at runtime.

I need to know has anyone tried parsing such pages?

I am new to this field so some theory on this topic would help my understanding in parsing webpages.

In this case its probably easier (though not always). The data being used to generate content is likely part of ajax requests you can sent a request to those ajax endpoints directly and parse the response from the endpoint.

Often this will be in JSON, which is quite easy to parse compared to HTML.

John Resig wrote an HTML Parser.

Demo: http://ejohn.org/blog/pure-javascript-html-parser/

This can workout for you.