用于转换mailto链接的正则表达式

I've been using this regular express (probably found on stackoverflow a few years back) to convert mailto tags in PHP:

preg_match_all("/<a([ ]+)href=([\"']*)mailto:(([[:alnum:]._\-]+)@([[:alnum:]._\-]+\.[[:alnum:]._\-]+))([\"']*)([[:space:][:alnum:]=\"_]*)>([^<|@]*)(@?)([^<]*)<\/a>/i",$content,$matches);

I pass it $content = '<a href="mailto:name@domain.com">somename@domain.com</a>'

It returns these matched pieces:

0 <a href="mailto:name@domain.com">somename@domain.com</a>
1  
2 "
3 name@domain.com
4 name
5 domain.com
6 "
7 
8 somename
9 @
10 domain.com

Example usage: <a href="send.php?user=$matches[4][0]&dom=$matches[5][0]">ucwords($matches[8][0])</a>

My problem is, some links contain nested tags. Since the preg expression is looking for "<" to get pieces 8,9,10 and nested tags are throwing it off...

Example: <a href="mailto:name@domain.com"><span><b>somename@domain.com</b></span></a>

I need to ignore the nested tags and just extract the "some name" piece:

match part 8 = <span><b>
match part 9 = somename
match part 10 = @
match part 11 = domain.com
match part 12 = </b></span>

I've tried to get it to work by tweaking ([^<|@]*)(@?)([^<]*) but I can't figure out the right syntax to match or ignore the nested tags.

You could just replace the whole match between the <a> tag with a .*?. Replace ([^<|@]*)(@?)([^<]*) with (.*?) and it would include everything within the <a> tag including nested tags. You can remove the nested tags after that with striptags or another regex.

However, regular expressions are not very good at html nested tags. You are better off using something like DOMDocument, which is made exactly for parsing html. Something like:

<?php
$DOM = new DOMDocument();
$DOM->loadXML('<a href="mailto:name@domain.com"><span><b>somename@domain.com</b></span></a>');

$list = $DOM->getElementsByTagName('a');

foreach($list as $link){
    $href = $link->getAttribute('href');
    $text = $link->nodeValue;
    //only match if href starts with mailto:
    if(stripos($href, 'mailto:') === 0){
        var_dump($href);
        var_dump($text);
    }
}

http://codepad.viper-7.com/SqDKgr

You can try this pattern:

$pattern = '~\bhref\s*+=\s*+(["\'])mailto:\K(?<mail>(?<name>[^@]++)@(?<domain>.*?))\1[^>]*+>(?:\s*+</?(?!a\b)[^>]*+>\s*+)*+(?<content>[^<]++)~i';
preg_match_all($pattern, $html, $matches, PREG_SET_ORDER);
echo '<pre>' . print_r($matches, true) . '</pre>';

and you can access your data like that:

echo $matches[0]['name'];

To only get access to the part within the link, try

[^>]*>([^>]+)@.* What you need should be in the first group of the result.

Try this regex

/^(<.*>)(.*)(@)/

/^/- Start of string

/(<.*>)/ - First match group, starts with < then anything in between until it hits >

/(.*)(@)/ - Match anything up to the parenthesis