preg_match_all到下一个注释标签HTML包含。 评论

I try to get all text to to the next occurrence of the comment tag and the text between the brackets from the comment tag. At the moment i only get the comment tag text between the brackets but not the content to the next comment its only returns a empty string "" I'm kind of confused. Thanks!

header("Content-Type:text/plain");
$tmp= file_get_contents("filter.html");
preg_match_all('@<!--\[(.*?)\]-->(.*?)@su', $tmp, $found, PREG_SET_ORDER);
var_dump($found);

filter.html

<!--[%TEST%]-->
TEST
TEST
<!--[%DAS%]-->
DAS TEST
123456
<!--[%BKK%]-->
ABCDEFG
YXZ

The output i get is:

array(3) {
  [0]=>
  array(3) {
    [0]=>
    string(15) "<!--[%TEST%]-->"
    [1]=>
    string(6) "%TEST%"
    [2]=>
    string(0) ""
  }
  [1]=>
  array(3) {
    [0]=>
    string(14) "<!--[%DAS%]-->"
    [1]=>
    string(5) "%DAS%"
    [2]=>
    string(0) ""
  }
  [2]=>
  array(3) {
    [0]=>
    string(14) "<!--[%BKK%]-->"
    [1]=>
    string(5) "%BKK%"
    [2]=>
    string(0) ""
  }
}

Solution: change the regex into...

@<!--\[(.*?)\]-->(.*?)(?=<!--|$)@su

Codepad Viper Demo.


Explanation: the original regex almost correctly used .*? expression to get all the non-comments part. I said 'correctly', because the laziness modifier is indeed required here (otherwise the .* combo will happily consume the whole string). And I said 'almost', because the modifier is too lazy in this particular case - even an empty string is enough to satisfy it (as '' does match /.*/). That's why you get those empty strings in the $found - the victims of laziness taken to the extreme, they were...

So what we need is to make this part of the regex a bit more 'eager' - persuade it to keep devouring the string until it...

  • either encounters the beginning of the new comment ('
  • or arrives at the end of the string.

And that's exactly expressed by this lookahead pattern:

(?=<!--|$)

It reads as 'match ONLY at the position that's either followed by a new comment, or is actually the end of the string'. And that's how it whips this lazy .*? sub-expression into a helpful movement - no longer it's able to stop wherever it alone wants to.