I tried to search for an answer to this for a while but could not find it. There were many posts related to matching text which is not preceeded by certain text but none seems to work for this case where + is matched but it is allowed only when preceeded by a single + (eg. ++)
I am trying to remove punctuation marks from text but let two consecutive ++ signs to stay but single + signs to disappear
$text="Hello World! C+ C++ C#";
print_r(preg_replace('/(?!\+\+)[[:punct:]]/', ' ', $text));
Results in (I am not sure why the latter + is removed? can somebody explain?):
Hello World C C+ C
If I try:
$text="Hello World! C+ C++ C#";
print_r(preg_replace('/(?!\+)[[:punct:]]/', ' ', $text));
Result is:
Hello World C+ C++ C
But the result I want is:
Hello World C C++ C
Thanks
UPDATE: I realized that I should probably mention that I will have other characters which I want to avoid. I may have oversimplified the question. For example I may want to avoid # also thus result would be
Hello World C C++ C#
the solution should be easily expandable. I am sorry about the inconvenience caused by this missing information.
You have a couple of choices here, one being:
(?<!\+)[+#](?!\+)
# with lookarounds making sure no + is after/behind
PHP
:<?php
$regex = '~(?<!\+)[+#](?!\+)~';
$string = 'Hello World! C+ C++ C#';
$string = preg_replace($regex, '', $string);
echo $string;
?>
(*SKIP)(*FAIL)
mechanism (which is a bit faster in this example):\+{2}(*SKIP)(*FAIL)|[+#]
# let two consecutive ++ always fail
See a demo for this one on regex101.com as well.
Last but not least: If you want to add characters/expressions that should be avoided as well, you can put them in a non-capturing group and let this one fail:
(?:\#|\+{2})(*SKIP)(*FAIL)|
[[:punct:]]
Yet another demo on the wonderful regex101.com site.
Your first regex (?!\+\+)[[:punct:]]
doesn't work because it looks for two consecutive +
signs in a negation - at each position - then asserts next immediate character to be a punctuation mark. When it sees C++
, cursor being on next to the first +
sign, this match succeeds since there is no +
after second +
. So first +
is matched.
Hello World! C+ C+|+ C#
^ Cursor here - (?!\+\+)[[:punct:]] is matched
Regex:
[[:punct:]]++((?<=\+)(?<=[^+]\+))
A possessive match in addition to a conditional positive lookbehind assertion will do the job.
Explanation:
[[:punct:]]++ // Match punctuation marks possessively - won't allow backtrack
((?<=\+) // Start of a conditional statement, check if last match is a `+`
(?<=[^+]\+) // If yes, it should not be preceded by another `+`
) // End of conditional
PHP:
preg_replace('@[[:punct:]]++((?<=\+)(?<=[^+]\+))@', ' ', $text)
Update
If +
singes are always preceded by some letters there is a much shorter solution:
\b\+(?!\+)
The first code snippet works like this: a punctuation symbol is found and if it is not a starting point for a ++
sequence, it is matched and removed. So, the second +
in C++
is matching, and is removed.
You may match and discard from the match using (*SKIP)(*FAIL)
verbs what you want to keep and just match what you want to remove:
preg_replace('/\+{2}(*SKIP)(*F)|[[:punct:]]+/', ' ', $text);
Adding more characters - just in case:
preg_replace('/(?:[#^]|\*{3}|\+{2})(*SKIP)(*F)|[[:punct:]]+/', ' ', $text);
^^^ ^
See the PHP demo
Details:
\+{2}(*SKIP)(*FAIL)
- Matches 2 +
symbols and then discards them from the match|
- or[[:punct:]]+
- matches one or more punctuation symbols.In the replacement pattern, we just replace with a space.
I think there are three cases here to match plus sign.
The double plus has to be matched to move past it.
Note - This follows left to right rules about plus signs. With no rules but these.
Find:
[^\P{P}+]|(\+\+)\+|\+
Replace: '$1 '
Explained
[^\P{P}+] # Punctuation but not plus
|
( \+\+ ) # (1), Plus with leading ++
\+
|
\+ # Any old plus sign
Which can be reduced to
[^\P{P}+] # Punctuation but not plus
|
( \+\+ )? # (1), Plus with optional leading ++
\+