在php中的正则表达式与负面的lookbehind

I'm doing some SEO of huge catalog product descriptions using preg_replace_callback and have some difficulties with regex.

I'd like to replace all these words (hat, shirt) except ones after "men's" + 0-2 words between, e.g. "men's pretty black hat", "men's long shirt" shouldn't be replaced.

Here is a debug code, in real application I use callback to pick proper replacement for each word:

$str = "men's black hat, and orange shirt!";
preg_match_all('/((\s|\.\s|,\s|\!\s|\?\s)(hat|shirt)(\s|\.|\.\s|,\s|\!|\!\s|\?|\?\s))/i', $str, &$_matches);
print_r($_matches);

Thanks

Lookbehind must be of fixed length, so this way of attacking the problem won't work.

IMHO you are trying to make preg_relace_callback do way too much. If you want to perform manipulation that is complex beyond a certain level, it's reasonable to forfeit the convenience of a single function call. Here's another way you can attack the problem:

  1. Use preg_split to split the text into words along with the flag PREG_SPLIT_OFFSET_CAPTURE so that you know where each word appears in the original text.
  2. Iterate over the array of words. It's now very easy to do a "negative lookbehind" on the array and see if a hat or shirt is preceded by any one of the other terms that interest you.
  3. Whenever you find a positive match for hat or shirt, use the offset from preg_split and the (known) length of the positive match to power substr_replace on the original text input.

For example:

$str = "men's black hat, and orange shirt!";
$targets = array('hat', 'shirt');
$shield = 'men\'s';
$bias = 0;

for ($i = 0; $i < count($words); ++$i) {
    list ($word, $offset) = $words[$i];

    if (!in_array($word, $targets)) {
        continue;
    }

    for ($j = max($i - 2, 0); $j < $i; ++$j) {
        if ($words[$j][0] === $shield) {
            continue 2;
        }
    }

    $replacement = 'FOO';
    $str = substr_replace($str, $replacement, $offset + $bias, strlen($word));
    $bias += strlen($replacement) - strlen($word);
}

echo $str;

See it in action.

I don't think variable-length negative lookbehinds are possible.

A trick is to reverse the string and use negative lookaheads. So, where you'd "ideally" want to do:

preg_match_all('/(?<!\bmen\'s\s+(\w+\s+){0,2})(hat|shirt)\b/i', $str, &$_matches);

you could do

preg_match_all('/\b(tah|trihs)(?!(\s+\w+){0,2}\s+s\'nem\b)/i', strrev($str), $rev_matches);

and then use array_map to reverse all the results back.

By the way, \b is known as a word boundary. They're probably what you mean to use instead of all the (\s|\.|\.\s|,\s|\!|\!\s|\?|\?\s).