I have a string of the form:
Some Text[Opening]Really Really Long Text...[Closing]More Text[Closing]Even More Text
I want to extract Really Really Long Text... from the string with a regular expression. Up until the first [Closing].
If I do a regular expression like this:
$pMatch = "'\[Opening\](.+)\[Closing\]'si";
That gives me:
Really Really Long Text...[Closing]More Text
I can also make it not greedy like this:
$pMatch = "'\[Opening\](.+?)\[Closing\]'si";
Which works and gives me the correct output:
Really Really Long Text...
However, if I replace "Really Really Long Text..." with actual really really long text, it doesn't work and instead I receive a PREG_BACKTRACK_LIMIT_ERROR. I don't get an error if I use the greedy regular expression. I just get the wrong output as in the first case.
I've been working with regular expressions for a while, but this one has me stumped. Is there a way to get this to work with a regular expression or is regular expression not suitable for this task?
Here is PHP code to reproduce the issue:
<?php
$sShortString = "Some Text[Opening]Really Really Long Text...[Closing]More Text[Closing]Even More Text";
$sLongString = "Some Text[Opening]".str_repeat("BLAH", 1000000)."[Closing]More Text[Closing]Even More Text";
$pGreedyMatch = "'\[Opening\](.+)\[Closing\]'si";
$pNonGreedyMatch = "'\[Opening\](.+?)\[Closing\]'si";
header("Content-Type: text/plain");
if (preg_match($pGreedyMatch, $sShortString, $aMatch)) {
echo "Greedy Match:
";
print_r($aMatch);
}
if (preg_match($pNonGreedyMatch, $sShortString, $aMatch)) {
echo "Non-Greedy Match:
";
print_r($aMatch);
}
if (preg_match($pGreedyMatch, $sLongString, $aMatch)) {
echo "Greedy Match:
";
echo "Length: ".strlen($aMatch[1])."
";
}
if (preg_match($pNonGreedyMatch, $sLongString, $aMatch)) {
echo "Non-Greedy Match:
";
echo strlen($aMatch[1]);
} else {
echo "Non-Greedy Doesn't Match!
";
}
$iLastError = preg_last_error();
if ($iLastError == PREG_BACKTRACK_LIMIT_ERROR) {
echo "It's because the backtrack limit was exceeded!
";
}
?>
I get the output:
Greedy Match:
Array
(
[0] => [Opening]Really Really Long Text...[Closing]More Text[Closing]
[1] => Really Really Long Text...[Closing]More Text
)
Non-Greedy Match:
Array
(
[0] => [Opening]Really Really Long Text...[Closing]
[1] => Really Really Long Text...
)
Greedy Match:
Length: 4000018
Non-Greedy Doesn't Match!
It's because the backtrack limit was exceeded!
I've got it working by using the greedy regular expression and using additional code to strip off the text from [Closing] onward. I would like to better understand what's happening behind the scenes, why it needs to do so much backtracking, and if there's a way that the regular expression can be modified so it performs the task.
I really appreciate any insight!
A non-greedy quantifier has a cost because each time it reads a character, it has to check against the end of the pattern.
In the above pattern, each time the .
in (.+?)
matches, it does a check to see if the following characters match [Closing]
. Each time this happens, and it doesn't match, it has to backtrack and continue the search. This is why the backtrack limit it used up.
The pattern can be rewritten like this:
'\[Opening\]([^\[]*(?:\[(?!Closing)[^\[]*)*)(*SKIP)\[Closing\]'si
Let's examine this pattern piece by piece to understand it.
1) We open with \[Opening\]
. This pattern matches the opening tag.
2) As our pattern isn't repeating within itself, the ()(*SKIP)
directive is used as a further optimization. It means that if we don't match the pattern then we will restart our search from the end of where we were looking. The default behaviour would start to search again at the next character.
To better understand this, imagine that our string is sometimes we get [Close to matching
. When we get to the [
, we scan [Clos
before we conclude that this actually isn't the pattern we want. Normally, we backtrack and then start again looking at Close
. However, (*SKIP)
allows us to continue searching at e to matc
.
3) Inside our brackets we start with the pattern [^\[]*
, which allows us to match as many characters as we can which are not [
. ^
indicates not, \[
is for the [
, and []
surrounds it as a character set. *
allows it to repeat as many times as possible.
4) Now, we have (?:)*
. ()
allows us to specify a string, and ?:
indicates that is not going to be saved, and *
allows it to repeat as many times as we like (including no times at all).
5) The first character in that string is \[
or just the [
we expect as part of our closing tag.
6) Next, we have (?!Closing\])
. (?!)
is a negative lookahead. A lookahead means that the parser will look at the next characters and either match or fail to match without consuming the characters. This allows us to match something as long as it's not Closing]
without actually consuming it.
7) We have another [^\[]*
which allows us to continue to eat characters after our failure to lookahead. This allows us to continue to consume the string after we get something like [Clos
.
8) Finally, our regular expression ends with \[Closing\]
.