Webcrawler提取链接元素

I want to extract elements from a webpage.

$html = file_get_contents($link);

That function returns the complete html file, and I only want the internal and external links to save them in the database.

$sql = "INSERT INTO prueba (link, title, description) VALUES (?, ?, ?)";

//preparando los datos
$query = $pdo->prepare($sql);

//orden de ejecucion
$result = $query->execute([
  $link,
  $title_out,
  $description
]);

Here, I am already managing to extract the description and the title, and I manage to place them in the database, but I want to extract all the external and internal links. The internal links in one column and the external links in another. I already have both columns in the database created.

I suggest using a DOM-Parser library like:

Parse the HTML and just "query" for all anchors (a tags).

Much less error-prone than trying to extract them by yourself using regexes for example.

HTML scrapping

For that I advice you to use opensource libraries that provide helping functions to navigate into the DOM. Without this you'll have to maintain so much more code. If you want to manage scrapping to multiples pages, you'll have to updade your regex queries at each update of the page.

You don't want that ^^'

One example from "Goutte" library ( I hope you are in +PHP 5.5)

$links = [];
$crawler->filter('a')->each(function ($node) {
    var_dump($node->attr('href'));
    $links[] = $node->attr('href');
});

$links now contains all the links a attribute in the page

For more example about node travelling please see this link

Use your database logic to persist this data

Sorry if there is an error into Goutte's code I don't use it often