I use simple_html_dom.php
for get all values of href from web pages. This is my code:
<?php
include_once('simple_html_dom.php');
$url=$_GET['url']; //this is the target website address (for example, http://127.0.0.1/mysite/default.php?url=https://www.google.com)
if($url){
$html = file_get_html($url);
foreach($html->find('a') as $e) {
echo $e->href . '<br>';
}
}
?>
But.. the problem is output. The output like this: /about
, /domains
, etc. or //en.wikipedia.org
, //ro.wikipedia.org
, etc. and much more.
How to convert these outputs to a standard url, for example: http://www.example.com/about
or https://www.example.com/page
, etc.?
/**
* @param $href string URL To Convert
* @param $base_url string Remote server's base url. Like wikipedia.org (without http or https)
*/
function convert_url($href, $base_url = NULL){
$parse = parse_url($href);
$host = array_key_exists('host', $parse) ? $parse['host'] : $base_url;
$path = array_key_exists('path', $parse) ? $parse['path'] : '/';
$queryStr = array_key_exists('query', $parse) ? '?'.$parse['query'] : '';
$scheme = array_key_exists('scheme', $parse) ? $parse['scheme'].'://' : 'http://';
return $scheme.$host.$path.$queryStr;
}
Something like this:
include_once('simple_html_dom.php');
$url = isset($_GET['url']) ? $_GET['url'] : '';
$parsedUrl = parse_url($url);
if (!empty($parsedUrl['scheme']) && !empty($parsedUrl['host'])) {
$html = file_get_html($url);
foreach ($html->find('a') as $link) {
$l = http_build_url($link->href, [
'scheme' => $parsedUrl['scheme'],
'host' => $parsedUrl['host']
]);
echo $l . '<br>';
}
}
See documentation of function http_build_url
for more information.