PHP PCRE匹配标点但不是++

I tried to search for an answer to this for a while but could not find it. There were many posts related to matching text which is not preceeded by certain text but none seems to work for this case where + is matched but it is allowed only when preceeded by a single + (eg. ++)

I am trying to remove punctuation marks from text but let two consecutive ++ signs to stay but single + signs to disappear

$text="Hello World! C+ C++ C#";
print_r(preg_replace('/(?!\+\+)[[:punct:]]/', ' ', $text));

Results in (I am not sure why the latter + is removed? can somebody explain?):

Hello World C C+ C

If I try:

$text="Hello World! C+ C++ C#";
print_r(preg_replace('/(?!\+)[[:punct:]]/', ' ', $text));

Result is:

Hello World C+ C++ C

But the result I want is:

Hello World C C++ C

Thanks

UPDATE: I realized that I should probably mention that I will have other characters which I want to avoid. I may have oversimplified the question. For example I may want to avoid # also thus result would be

Hello World C C++ C#

the solution should be easily expandable. I am sorry about the inconvenience caused by this missing information.

You have a couple of choices here, one being:

(?<!\+)[+#](?!\+)
# with lookarounds making sure no + is after/behind

See a demo on regex101.com.


In PHP:
<?php

$regex = '~(?<!\+)[+#](?!\+)~';

$string = 'Hello World! C+ C++ C#';
$string = preg_replace($regex, '', $string);

echo $string;
?>


Another one would be to use the (*SKIP)(*FAIL) mechanism (which is a bit faster in this example):
\+{2}(*SKIP)(*FAIL)|[+#]
# let two consecutive ++ always fail

See a demo for this one on regex101.com as well.

Last but not least: If you want to add characters/expressions that should be avoided as well, you can put them in a non-capturing group and let this one fail:

(?:\#|\+{2})(*SKIP)(*FAIL)|
[[:punct:]]

Yet another demo on the wonderful regex101.com site.

Your first regex (?!\+\+)[[:punct:]] doesn't work because it looks for two consecutive + signs in a negation - at each position - then asserts next immediate character to be a punctuation mark. When it sees C++, cursor being on next to the first + sign, this match succeeds since there is no + after second +. So first + is matched.

Hello World! C+ C+|+ C#
                  ^ Cursor here - (?!\+\+)[[:punct:]] is matched

Regex:

[[:punct:]]++((?<=\+)(?<=[^+]\+))

A possessive match in addition to a conditional positive lookbehind assertion will do the job.

Live demo

Explanation:

[[:punct:]]++   // Match punctuation marks possessively - won't allow backtrack
((?<=\+)        // Start of a conditional statement, check if last match is a `+`
    (?<=[^+]\+) // If yes, it should not be preceded by another `+`
)               // End of conditional

PHP:

preg_replace('@[[:punct:]]++((?<=\+)(?<=[^+]\+))@', ' ', $text)

Update

If + singes are always preceded by some letters there is a much shorter solution:

\b\+(?!\+)

The first code snippet works like this: a punctuation symbol is found and if it is not a starting point for a ++ sequence, it is matched and removed. So, the second + in C++ is matching, and is removed.

You may match and discard from the match using (*SKIP)(*FAIL) verbs what you want to keep and just match what you want to remove:

preg_replace('/\+{2}(*SKIP)(*F)|[[:punct:]]+/', ' ', $text);

Adding more characters - just in case:

preg_replace('/(?:[#^]|\*{3}|\+{2})(*SKIP)(*F)|[[:punct:]]+/', ' ', $text);
               ^^^                ^

See the PHP demo

Details:

  • \+{2}(*SKIP)(*FAIL) - Matches 2 + symbols and then discards them from the match
  • | - or
  • [[:punct:]]+ - matches one or more punctuation symbols.

In the replacement pattern, we just replace with a space.

I think there are three cases here to match plus sign.
The double plus has to be matched to move past it.

Note - This follows left to right rules about plus signs. With no rules but these.

Find:

[^\P{P}+]|(\+\+)\+|\+

Replace: '$1 '

Explained

    [^\P{P}+]           # Punctuation but not plus
 |  
    ( \+\+ )            # (1), Plus with leading ++
    \+
 |  
    \+                  # Any old plus sign

Which can be reduced to

   [^\P{P}+]           # Punctuation but not plus
|  
   ( \+\+ )?           # (1), Plus with optional leading ++
   \+