I have a small tricky issue about golang regex. seems \b
boundering option doesn't work when I put latein chars like this.
I expected that é
should be treated as a regular char.. but it's treated as one of boundering wards.
package main
import (
"fmt"
"regexp"
)
func main() {
r, _ := regexp.Compile(`\b(vis)\b`)
fmt.Println(r.MatchString("re vis e"))
fmt.Println(r.MatchString("revise"))
fmt.Println(r.MatchString("révisé"))
}
result was:
true
false
true
Please give me any suggestion how to deal with r.MatchString("révisé")
as false
?
Thank you
The issue is that \b
is only for boundaries around ASCII characters, as stated in the docs:
at ASCII word boundary (\w on one side and \W, \A, or \z on the other)
And é
is not ASCII. But, you can make your own \b
replacement by combining other regex shortcuts. Here is a simple solution that solves the case given in the question, though you may want to add more thorough matching:
package main
import (
"fmt"
"regexp"
)
func main() {
r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`)
fmt.Println(r.MatchString("vis")) // added this case
fmt.Println(r.MatchString("re vis e"))
fmt.Println(r.MatchString("revise"))
fmt.Println(r.MatchString("révisé"))
}
Running this gives:
true
true
false
false
What this solution does is essentially replace \b
with (?:\A|\z|\s)
, which means "a non-capturing group with one of the following: start of string, end of string or whitespace". You may want to add other possibilities here, like punctuation.