正则表达式正则表达式匹配大多数URL需要改进

I need a function which will check for the existing URLs in a string.

function linkcleaner($url) {
$regex="(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))";

if(preg_match($regex, $url, $matches)) {
echo $matches[0];
}
}

The regular expression is taken from the John Gruber's blog, where he addressed the problem of creating a regex matching all the URLs. Unfortunately, I can't make it work. It seems the problem is coming from the double quotes inside the regex or the other punct symbols at the end of the expression. Any help is appreciated. Thank you!

Apart from @tandu's answer, you also need delimiters for a regex in php.

The easiest would be to start and end your pattern with an # as that character does not appear in it:

$regex="#(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))#";

You need to escape the " with a \

Jack Maney's comment...EPIC :D

On a more serious note, it does not work because you terminated the string literal right in the middle.

To include a double quote (") in a string, you need to escape it using a \

So, the line will be

$regex="/(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))/";

Notice I've escaped the (') as well. That is for when you define a string between 2 single quotes.

I am not sure how you guys read this regex, cause it's a real pain to read/modify... ;)

try this (this is not a one-liner, yes, but it is easy to understand and modify if needed):

<?php
$re_proto = "(?:https?|ftp|gopher|irc|whateverprotoyoulike)://";
$re_ipv4_segment = "[12]?[0-9]{1,2}";
$re_ipv4 = "(?:{$re_ipv4_segment}[.]){3}".$re_ipv4_segment;
$re_hostname = "[a-z0-9_]+(?:[.-][a-z0-9_]+){0,}";
$re_hostname_fqdn = "[a-z0-9_](?:[a-z0-9_-]*[.][a-z0-9]+){1,}";
$re_host = "(?:{$re_ipv4}|{$re_hostname})";
$re_host_fqdn = "(?:{$re_ipv4}|{$re_hostname_fqdn})";
$re_port = ":[0-9]+";
$re_uri = "(?:/[a-z0-9_.%-]*){0,}";
$re_querystring = "[?][a-z0-9_.%&=-]*";
$re_anchor = "#[a-z0-9_.%-]*";
$re_url = "(?:(?:{$re_proto})(?:{$re_host})|{$re_host_fqdn})(?:{$re_port})?(?:{$re_uri})?(?:{$re_querystring})?(?:{$re_anchor})?";

$text = <<<TEXT
http://www.example.com
http://www.example.com/some/path/to/file.php?f1=v1&f2=v2#foo
http://localhost.localdomain/
http://localhost/docs/???
www....wwhat?
www.example.com
ftp://ftp.mozilla.org/pub/firefox/latest/
Some new Mary-Kate Olsen pictures I found: the splendor of the Steiner Street Picture of href… http://t.co/tJ2NJjnf
TEXT;

$count = preg_match_all("\01{$re_url}\01is", $text, $matches);
var_dump($count);
var_dump($matches);
?>