检测和编辑外部链接

I want to say that I'm aware of similar questions on SO, but since my situation is slightly different I thought it would be better to open a new question. I did search for an hour, I may have missed something, if so please forgive me for that.

The problem: I'm developing a feature similar to facebook: user can post a text message which may contain a number of links, these may or many not be put in anchor tags, and may have different protocols (http, https, ftp,....)

I need to

  1. detect these links and perhaps attempting to retrieve them (just like facebook). I guess this is the task for jquery?

  2. I also need to reliably detect the external links and change them to mysite.com/external?url=thelink. Which, I believe, is that task for php (since I can't trust the input coming from client side right?)

Anyhow, with the links not guaranteed to be in anchor tags, it doesn't seem very reliable to use a dom parser (or am I wrong)? I found a simple regex on the web (Im terrible with regex btw) which I think I can make use of (by adding a lot more protocols)

$strText = preg_replace( '/(http|ftp)+(s)?:(\/\/)((\w|\.)+)(\/)?(\S+)?/i', '<a href="\0">\4</a>', $strText );  

Can some experts out there who have experience in this task please point me to the right direction?

Yup, this is definitely something you want to do server-side. First off, if you're accepting user input containing HTML markup, you should be sanitizing it with a good HTML filter like HTML Purifier. (This will also make their input easier to parse for more complex markup.)

This should be doable within a single preg_replace() statement, but I'd split it into something like this:

$hrefPattern = '/<a[^>]+?href="(.+?)".*?>/i';

$outLink = 'http://mysite.com/external?url=';

$offset = 0;
while(preg_match($hrefPattern, $text, $hrefMatches, PREG_OFFSET_CAPTURE, $offset))
{

    $hrefInner = $hrefMatches[1][0];
    $offset = $hrefMatches[1][1];
    echo $hrefInner . "
";

    if(strpos($hrefInner, '://') !== false)
    {
        $externalUrl = $outLink . rawurlencode($hrefInner);
        $text = str_replace($hrefInner, $externalUrl, $text);
        $offset += strlen($externalUrl);
    }

}

The preg_match() documentation explains that pretty well. We're basically just looking up each <a ... href=""> tag, grabbing it's contents, reformatting it if it starts with (anything)://, and repeating until there are no more links left in $text. If you reformat the link, you need to rawurlencode() the link you scraped to make sure the new link is valid.

The way Facebook scrapes content for it's link snippets is, I'd imagine, a lot more complex than that, but yes - you'd want to send an AJAX request to a PHP page that scrapes the link in question and generates whatever snippet you want. There's quite a bit more involved in that, though- you'll have to handle if the page does not exist, redirects to another page, has invalid markup, different document types, and so on.

Hope that helps!