在PHP中抓取多个网站并在常用页面名称下保存内容[关闭]

I have a requirement of scraping multiple websites (like 100 different websites) and save their main page contents in a database. But the problem is not all those websites have the same structure and same link texts. For example, one site may have "About Us" and the other site's same content may be under a page called "Who we are". Therefore it is difficult to identify and list common contents in to one database column. Likewise traversing the inner pages of 100 or more websites and saving scraped data for each page and put them in common columns become more problematic to solve. How can I resolve this? I would appreciate any ideas that can help me do this. I am using PHP and cUrl to develop the solution.

More clear example below.

Site 1 links - Home / About Us / Products / Contact Us

Site 2 links - Home / Who we are / Services / FAQ / Contact

Site 3 links - Home / What we do / Our Company / Contact Us

Site 4 links - Home / Register / Shop / Where we are

Now I want the above links to be auto categorized as follows,

About Us column - About Us, Who we are / Our Company

Contact Us column - Contact Us / Contact / Where we are

Products column - Products / What we do

P.S. I prefer to hear about methods not a coding example.

Yeah. You may have to build a bot to do it in C, C# or C++ and code in a list of different examples of what the same thing might mean.

It could look like this:

switch (possiblenames)
{
    case About Us:
    ### Rest of Code
    break;
    case Who are we?:
    ### Rest of Code
    break;
}

Just use curl or wget on your list of urls and then store the whole data into your database. But then, if you also want to display the page from the data that was stored, you have to store also the data related to the page (css, js, images, ...) as a web browser would do when you do a "Save page as ...".

Instead of scouring the page content for "About", "About Us", "Who We Are", etc, just do that with the page links: <a href="about">, as those will most likely be more standardized than the actual link text: "About Us".

Build up a list of keywords, and then filter the links/URLs through those keywords... this should help you more easily put them into the proper categories. If something doesn't match your keywords, then put that in an "to be edited" list. Look at that content, figure out where it belongs and then add that to your keyword list.