I am pulling in RSS feeds and using DOMXPath to convert all existing anchor tags to custom tags that look like this for various reasons:
[webserviceLink]{$url}[/webserviceLink][webserviceLinkName]{$text}[/webserviceLinkName]
This works great, but I'd also like to covert all non-html text links to this same format, but am having some issues.
Here's my code for converting the text links:
$pattern = '(?xi)(?<![">])\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
$desc = preg_replace_callback("#$pattern#i", function($matches)
{
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
if (strlen($input) > 20 && !strpos($input, " "))
$input = substr($input, 0, 18)."... ";
return "[webserviceLink]{$url}[/webserviceLink][webserviceLinkName]{$input}[/webserviceLinkName]";
}, $desc);
I'm not sure how to do the negative callback in this regex to check that the link I am converting is not in an existing html tag, like an img, or in my custom link tags above.
I was able to use xpath to get this working.
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($desc, 'HTML-ENTITIES', 'UTF-8'));
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()[not(ancestor::a)]') as $node)
{
$pattern = '((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
$replaced = preg_replace_callback("#$pattern#i", function($matches)
{
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
if (strlen($input) > 20 && !strpos($input, " "))
$input = substr($input, 0, 18)."... ";
return "<a href=\"{$url}\">{$input}</a>";
}, $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}