Consider the following toy example. I want to match in Go a name with a regexp where the name is sequences of letters a
separated by single #
, so a#a#aaa
is valid, but a#
or a##a
are not. I can code the regexp in the following two ways:
r1 := regexp.MustCompile(`^a+(#a+)*$`)
r2 := regexp.MustCompile(`^(a+#)*a+$`)
Both of these work. Now consider more complex task of matching a sequence of names separated by single slash. As in above, I can code that in two ways:
^N+(/N+)*$
^(N+/)*N+$
where N is a regexp for the name with ^ and $ stripped. As I have two cases for N, so now I can have 4 regexps:
^a+(#a+)*(/a+(#a+)*)*$
^(a+#)*a+(/a+(#a+)*)*$
^((a+#)*a+/)*a+(#a+)*$
^((a+#)*a+/)*(a+#)*a+$
The question is why when matching against the string "aa#a#a/a#a/a"
the first one fails while the rest 3 cases work as expected? I.e. what causes the first regexp to mismatch? The full code is:
package main
import (
"fmt"
"regexp"
)
func main() {
str := "aa#a#a/a#a/a"
regs := []string {
`^a+(#a+)*(/a+(#a+)*)*$`,
`^(a+#)*a+(/a+(#a+)*)*$`,
`^((a+#)*a+/)*a+(#a+)*$`,
`^((a+#)*a+/)*(a+#)*a+$`,
}
for _, r := range(regs) {
fmt.Println(regexp.MustCompile(r).MatchString(str))
}
}
Surprisingly it prints false true true true
You can try to alleviate atomic sub-nesting of quantifiers.
If you didn't have the anchors, the expression would possibly blow up
in backtracking if using nested optional quantifiers like this,
when it can't find a direct solution.
Go might be balking at that, forcing it to fail outright instead
of massive backtracking, but not sure.
Try something like this, see if it works.
# ^a+(#?a)*(/a+(#?a)*)*$
^
a+
( # (1 start)
\#?
a
)* # (1 end)
( # (2 start)
/
a+
( # (3 start)
\#?
a
)* # (3 end)
)* # (2 end)
$
edit: (Transposed from comment) If complexity is too high, some engines will not even compile it, some will silently fail at runtime, some will lock up in a backtracking loop.
This little subtle difference is the problem
bad: too complex (#?a+)*
good: no nested, open ended quantifiers (#?a)*
If you ever have this problem, strip out the nesting, usually solves
the backtracking issue.
eit2: If you require a separator and want to insure a separator is only in the middle and surrounded by a
's, you could try this
https://play.golang.org/p/oM6B6H3Kdx
# ^a+[#/](a[#/]?)*a+$
^
a+
[\#/]
( # (1 start)
a
[\#/]?
)* # (1 end)
a+
$
or this
https://play.golang.org/p/WihqSjH_dI
# ^a+(?:[#/]a+)+$
^
a+
(?:
[\#/]
a+
)+
$
This must be a bug in golang regexp engine. A simpler test case is that ^a(/a+(#a+)*)*$
fails to match a/a#a
while ^(a)(/a+(#a+)*)*$
works, see http://play.golang.org/p/CDKxVeXW98 .