正则表达式从html字符串中捕获所有相对和绝对链接

I need to catch catch all links from multiple websites. For that I have gathered the entire html file. I need a regular expression that puts all of them in an array.

I dont want to collect any image files or other code files. Just the html from the pages themselves.


I want It to collect all links like this:

/https://www.hello.com
/https://www.hello.com/index.php
/https://www.hello.com/world
/https://www.hello.com/world.php
/https://www.hello.com/world.html
/https://hello.com
/https://hello.com/world
/http://www.hello.com
/http://www.hello.com/world
/http://hello.com
/http://hello.com/world
/www.hello.com
/www.hello.com/world
/hello.com
/hello.com/world
/hello
/hello/world

But not like this:

hello 
hello/world
hello.png
hello.zip
/hello/world.png
/hello/world.js

What regular expression would I need for this? Or is there a better way? (maybe by collecting a's)

I guess you define "link" as hyperlinks in the form of <a href="...">. The following regex (already in the form of a PHP string) should be a good start*:

'<\\s*a\\s*[^>]*href\\s*=\\s*"([^"]+)"'

Test this regex

When using this with preg_match($regex, $html, $match), the $match[1] gives you the link, however, it is in an encoded form (it might contain html entities). To remove those, use html_entity_decode.

$link = html_entity_decode($match[1]);

You should also exclude links which are just fragments of the same site, that are links starting with the hash symbol: $link[0] == '#'


*This regex is not conform to the definition of the HTML language (I think this is impossible to do 100% correctly). The regex for example fails for links where the attribute is not wrapped in double quotes (they might be unquoted or quoted in single quotes).

Something like PHPQuery may be preferable to using a regex in this case. See this answer for an explanation of why.