We have a piece of regex that adds a <strong>
tag around keywords if they are not within a certain closing tag themselves. This has always worked nicely...
foreach ($keywords as $keyword) {
$str = preg_replace("/(?!(?:[^<]+>|[^>]+(<\/strong>|<\/a>|<\/b>|<\/i>|<\/u>|<\/em>)))\b(" . preg_quote($keyword, "/") . ")\b/is", "<strong>\\2</strong>", $str, 1);
}
So if the keyword was test
this would change:
A test line
to:
A <strong>test</strong> line
but this would not change:
<a href="">A test line</a>
As you can see the list of closing tags we want it to ignore is in the regex.
We have encountered a problem with a string that looks like:
<a href="">A test <em>line</em></a>
It's not recognising the closing </a>
or </em>
for that matter, so it's coming out as...
<a href="">A <strong>test</strong> <em>line</em></a>
Which we don't want it to do. Can anyone see if there is a fix to this (and yes I am aware of the don't parse HTML with regex rule so posting links to that infamous post is not an answer ;-))
The following regex try to match the keyword test
not enclosed by either a,b,i,u,em,strong
tags.
Regex
/^.*?(?!<(a|b|i|u|em|strong).*?>.*?)\btest\b(?!.*?<\/\1>)/i
Test
A test line => MATCH
<a href="">A test line</a> => NO MATCH
<a href="">A test <em>line</em></a> => NO MATCH
Discussion
^.*?(?!<(a|b|i|u|em|strong).*?>.*?) => The keyword `test' must not be preceded by
any tag listed followed by any character
\btest\b => Here we define the keyword we want to match
(?!.*?</\1>) => The keyword `test' must not be followed by
the tag opened previously
Tip
You can enhance the regexp for multiple keywords (kw1,kw2,kw3 in the example below) like this :
/^.*?(?!<(a|b|i|u|em|strong).*?>.*?)\b(?:kw1|kw2|kw3)\b(?!.*?<\/\1>)/i
Warning
This regex actually works on the provided test but not in all cases.