I cannot regexp.FindSubmatch in certain simple cases. For example, following code works properly:
assigned := regexp.MustCompile(`\x7f`)
group := assigned.FindSubmatch([]byte{0x7f})
fmt.Println(group)
(in playground it prints [[127]])
But if I change byte to 0x80 it does not work. Why?
As mentioned in the package documentation:
All characters are UTF-8-encoded code points.
So the regular expression \x80
does not match the byte value 0x80
, but rather the UTF-8 representation of the character U+0080
. This is evident if we change your test program to:
func main() {
assigned := regexp.MustCompile(`\x80`)
group := assigned.FindSubmatch([]byte{1, 2, 3, 0xc2, 0x80})
fmt.Println(group)
}
We now get a match for the two byte sequence [[194 128]]
, which represents that character in question.
There is no way to switch the regexp
package into a binary mode, so you will either need to convert your inputs to valid UTF-8, or use a different package to match your data.