如何将字符串中的每个URL替换为另一个唯一的URL?

I have the following:

$reg[0] = '`<a(\s[^>]*)href="([^"]*)"([^>]*)>`si';
$reg[1] = '`<a(\s[^>]*)href="([^"]*)"([^>]*)>`si';
$replace[0] = '<a$1href="http://www.yahoo.com"$3>';
$replace[1] = '<a$1href="http://www.live.com"$3>';
$string = 'Test <a href="http://www.google.com">Google!!</a>Test <a href="http://www.google.com">Google!!2</a>Test';
echo preg_replace($reg, $replace, $string);

Which results in:

Test <a href="http://www.live.com">Google!!</a>Test <a href="http://www.live.com">Google!!2</a>Test

I'm looking to end up with (the difference being in the first link):

Test <a href="http://www.yahoo.com">Google!!</a>Test <a href="http://www.live.com">Google!!2</a>Test

The idea is to replace each URL within a link within a string with a unique other URL. It's for a newsletter system where I want to track what people have clicked on, so the URL will be a "fake" URL which they will be redirected to the real URL after the click is recorded.

The problem is that your first replace string is going to be matched by the second search pattern, effectively overwriting the first replace string with the second replace string.

Unless you can somehow differentiate "modified" links from the original ones so that they won't get caught by the other expression (perhaps by adding an extra HTML property?), I don't think you can really solve this with a single preg_replace() call. One possible solution (aside from the differentiation in the regular expression) that comes to mind would be to use preg_match_all(), since it will give you an array of matches to work with. You could probably then encode the matched URLs with your tracking URL by iterating over the array and running a str_replace() on each matched URL.

I'm not good with regexps, but if what you're doing is just replacing external URLs (i.e. not part of your site/application) with an internal URL that will track click-thrus and redirect the user, then it should be easy to construct a regexp that will match only external URLs.

So let's say your domain is foo.com, then you just need to create a regexp that will only match a hyperlink that doesn't contain a URL starting with http://foo.com. Now, as I said, I'm pretty bad with regexps, but here's my best stab at it:

$reg[0] = '`<a(\s[^>]*)href="(?!http://foo.com)([^"]*)"([^>]*)>`si';

Edit: If you want to track click-thrus to internal URLs as well, then just replace http://foo.com with the URL of your redirect/tracking page, e.g. http://foo.com/out.php.

I'll walk through an example scenario just to show what I'm talking about. Let's say you have the below newsletter:

<h1>Newsletter Name</h1>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec lobortis,
ligula <a href="http://bar.com">sed sollicitudin</a> dignissim, lacus dolor
suscipit sapien, <a href="http://foo.com">eget auctor</a> ipsum ligula
non tortor. Quisque sagittis sodales elit. Mauris dictum blandit lacus.
Mauris consequat <a href="http://last.fm">laoreet lacus</a>.</p>

For the purpose of this exercise, the search pattern will be:

// Only match links that don't begin with: http://foo.com/out.php
`<a(\s[^>]*)href="(?!http://foo.com/out\.php)([^"]*)"([^>]*)>`si

This regexp can be broken down into 3 parts:

  1. <a(\s[^>]*)href="
  2. (?!http://foo.com/out\.php)([^"]*)
  3. "([^>]*)>

On the first pass of the search, the script will examine:

<a href="http://bar.com">

This link satisfies all 3 components of the regexp, so the URL is stored in the database and is replaced with http://foo.com/out.php?id=1.

On the second pass of the search, the script will examine:

<a href="http://foo.com/out.php?id=1">

This link matches 1 and 3, but not 2. So the search will move on to the next link:

<a href="http://foo.com">

This link satisfies all 3 components of the regexp, so it the URL is stored in the database and is replaced with http://foo.com/out.php?id=2.

On the 3rd pass of the search, the script will examine the first 2 (already replaced) links, skip them, and then find a match with the last link in the newsletter.

I do not know, if I'd understood it right. But I'd written following snippet: The regex matches some hyperlinks. Then it loops thru the result and compares the text nodes against the hyperlink references. When a text node is found in a hyperlink reference, then it extends the matches by inserting a trackback sample link with a unique key.

UPDATE The snippets finds all hyperlinks:

  1. find links
  2. build track back link
  3. find position of each found link (matches[3]) and set a template tag
  4. replace templatetags by trackback links Each link position is unique.

$string = '<h1>Newsletter Name</h1> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec lobortis, ligula <a href="http://bar.com">sed sollicitudin</a> dignissim, lacus dolor suscipit sapien, <a href="http://foo.com">bar.com</a> ipsum ligula non tortor. Quisque sagittis sodales elit. Mauris dictum blandit lacus. Mauris consequat <a href="http://last.fm">laoreet lacus</a>.</p> <h1>Newsletter Name</h1> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec lobortis, ligula <a href="http://bar.com">sed sollicitudin</a> dignissim, lacus dolor suscipit sapien, <a href="http://foo.com">bar.com</a> ipsum ligula non tortor. Quisque sagittis sodales elit. Mauris dictum blandit lacus. Mauris consequat <a href="http://last.fm">laoreet lacus</a>.</p> <h1>Newsletter Name</h1> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec lobortis, ligula <a href="http://bar.com">sed sollicitudin</a> dignissim, lacus dolor suscipit sapien, <a href="http://foo.com">bar.com</a> ipsum ligula non tortor. Quisque sagittis sodales elit. Mauris dictum blandit lacus. Mauris consequat <a href="http://last.fm">laoreet lacus</a>.</p> ';

$regex = '<[^>]+>(.*)<\/[^>]+>';
preg_match_all("'<a\s+href=\"(.*)\"\s*>(.*)<\/[^>]+>'U",$string,$matches);


$uniqueURL = 'http://www.yourdomain.com/trackback.php?id=';

foreach($matches[2] as $k2 => $m2){
    foreach($matches[1] as $k1 => $m1){
        if(stristr($m1, $m2)){
                $uniq = $uniqueURL.md5($matches[0][$k2])."_".rand(1000,9999);
                $matches[3][$k1] = $uniq."&refLink=".$m1;
        }
    }
}


foreach($matches[3] as $key => $val) {

    $startAt = strpos($string, $matches[1][$key]);
    $endAt= $startAt + strlen($matches[1][$key]);

    $strBefore = substr($string,0, $startAt);
    $strAfter = substr($string,$endAt);

    $string = $strBefore . "@@@$key@@@" .$strAfter;

}
foreach($matches[3] as $key => $val) {
        $string = str_replace("@@@$key@@@",$matches[3][$key] ,$string);
}
print "<pre>";
echo $string;

Until PHP 5.3 where you can just create a function on the spot, you have to use either create_function (which I hate) or a helper class.

/**
 * For retrieving a new string from a list.
 */
class StringRotation {
    var $i = -1;
    var $strings = array();

    function addString($string) {
        $this->strings[] = $string;
    }

    /**
     * Use sprintf to produce result string
     * Rotates forward
     * @param array $params the string params to insert
     * @return string
     * @uses StringRotation::getNext()
     */
    function parseString($params) {
        $string = $this->getNext();
        array_unshift($params, $string);
        return call_user_func_array('sprintf', $params);
    }

    function getNext() {
        $this->i++;
        $t = count($this->strings);
        if ($this->i > $t) {
            $this->i = 0;
        }
        return $this->strings[$this->i];
    }

    function resetPointer() {
        $this->i = -1;
    }
}

$reg = '`<a(\s[^>]*)href="([^"]*)"([^>]*)>`si';
$replaceLinks[0] = '<a%2$shref="http://www.yahoo.com"%4$s>';
$replaceLinks[1] = '<a%2$shref="http://www.live.com"%4$s>';

$string = 'Test <a href="http://www.google.com">Google!!</a>Test <a href="http://www.google.com">Google!!2</a>Test';

$linkReplace = new StringRotation();
foreach ($replaceLinks as $replaceLink) {
    $linkReplace->addString($replaceLink);
}

echo preg_replace_callback($reg, array($linkReplace, 'parseString'), $string);