I want to determine whether a given string can be created by joining any of a set of substrings. As a specific example, I want to split a string "sgene"
according to what part of the regex sg|ge|ne|n|s
it matches. The answer is "s"
, "ge"
, "ne"
, because those three parts are how the string can be decomposed into parts from the regex, the desired set of substrings.
Go has regexp.(*Regexp).FindAllString
, and Ruby has Regexp.scan
to do this. In my code, one match is lost regardless of whether I order the substrings before or after the superstrings since my regexes overlap.
Here is a program to reproduce the problem in Go:
package main
import (
"fmt"
"regexp"
)
func main() {
str := "sgene"
superBeforeSub := regexp.MustCompile("sg|ge|ne|n|s")
subBeforeSuper := regexp.MustCompile("n|s|sg|ge|ne")
regexes := []*regexp.Regexp{superBeforeSub, subBeforeSuper}
for _, rgx := range regexes {
fmt.Println(rgx.MatchString(str), rgx.FindAllString(str, -1))
}
}
This program outputs:
true [sg ne]
true [s ge n]
And here is the same program in Ruby (problem for Ruby is also seen here):
str = "sgene"
regexes = [/sg|ge|ne|n|s/, /n|s|sg|ge|ne/]
regexes.each do |regex|
puts "%s %s" % [(regex === str).to_s, str.scan(regex).inspect]
end
It outputs:
true ["sg", "ne"]
true ["s", "ge", "n"]
The regex engines are aware that the string can be matched by the regex, but FindAllString
and scan
do not match it the way the boolean match does. They seem to use a greedy longest match search that ignores at least one e. How can I use regex to split the string into [s ge ne]
in either language?
This answer concerns Ruby only.
We are given the regex
r = /sg|ge|ne|n|s/
For this regex,
"sgene".scan r
#=> ["sg", "ne"]
As I understand, you want to find a rearrangement of the order of the elements of the regex, r_new
, such that
"sgene".scan(r_new).join == "sgene"
Viewed differently, but equivalently, you are given an array and a string
arr = ["sg", "ge", "ne", "n", "s"]
target = "sgene"
and want to determine if there is a permutation of some or all of the elements of arr
, perm
, such that
target == perm.join
and are asking if this can be done using a regex. I don't believe it can, but I cannot prove that. Moreover, several of the comments cast doubt on that.
It can be done, however, as follows.
(1..arr.size).each_with_object([]) { |n, perms|
arr.permutation(n).each { |p| perms << p if p.join==target } }
#=> [["s", "ge", "ne"]]
I used select
, rather than any?
, so that all permutations that work are identified. For example:
arr = ["sg", "ge", "ne", "n", "s", "e"]
(1..arr.size).each_with_object([]) { |n, perms|
arr.permutation(n).each { |p| perms << p if p.join==target } }
#=> [["sg", "e", "ne"], ["s", "ge", "ne"], ["s", "ge", "n", "e"]]
Maybe I'm not understanding what you are actually trying to do, but simply changing the order of the symbols in the regex works. This isn't a greedy vs non-greedy thing as you are using no wildcards.
It's simply a match in order of regex problem.
Here's a modified version on Play. The only different thing being to change the regex itself to give you the desired output.
All I did was move "n" to the end of the pattern, as in s|sg|ge|ne|n
package main
import (
"fmt"
"regexp"
)
func main() {
str := "sgene"
superBeforeSub := regexp.MustCompile("sg|ge|ne|n|s")
subBeforeSuper := regexp.MustCompile("n|s|sg|ge|ne")
orderIActuallyWant := regexp.MustCompile("s|sg|ge|ne|n")
regexes := []*regexp.Regexp{superBeforeSub, subBeforeSuper, orderIActuallyWant}
for _, rgx := range regexes {
fmt.Println(rgx.MatchString(str), rgx.FindAllString(str, -1))
}
}