Short version: This prints 3, which makes sense because in Go strings are basically a slice of bytes, and it takes three bytes to represent this character. How can I get len, and regexp functions to work in terms of characters, not bytes.
package main
import "fmt"
func main() {
fmt.Println(len("ウ"))//returns 3
fmt.Println(utf8.RuneCountInString("ウ"))//returns 1
}
Background:
I'm saving text into the GAE datastore using JDO (Java).
Then I'm processing the text using Go, specifically I'm using regexp.FindStringIndex and saving the index to the datastore.
Then back in Java land I send the unmodified text, and index to the GWT client via json.
Somewhere along the way the indexes are 'shifting', so by the time its on the client, they are off.
It seems the issue has to do with character encoding, I'm assuming Java/Go are interpreting the text (indexes) differently utf-8 char/byte?. I see references to Runes in the regexp package.
I think I can either make regexp.FindStringIndex return byte indexes in go, or make GWT client understand the utf-8 indexes.
Any suggestions? I should be using UTF-8 incase I need to internationalize the app in the future, right?
Thanks
EDIT:
Also when I was finding the index using Java on the server things just worked.
On the client (GWT) I'm using text.substring(start,end)
TEST:
package main
import "regexp"
import "fmt"
func main() {
fmt.Print(regexp.MustCompile(`a`).FindStringIndex("ウィキa")[1])
}
The code outputs 10, not 4.
The plan is to get FindStringIndex to return 4, any ideas?
Update 2: Position Conversion
func main() {
s:="ab日aba本語ba";
byteIndex:=regexp.MustCompile(`a`).FindAllStringIndex(s,-1)
fmt.Println(byteIndex)//[[0 1] [5 6] [7 8] [15 16]]
offset :=0
posMap := make([]int,len(s))//maps byte-positions to char-positions
for pos, char := range s {
fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
posMap[pos]=offset
offset += utf8.RuneLen(char)-1
}
fmt.Println("posMap =",posMap)
for pos ,value:= range byteIndex{
fmt.Printf("pos:%d value:%d subtract %d
",pos,value,posMap[value[0]])
value[1]-=posMap[value[0]]
value[0]-=posMap[value[0]]
}
fmt.Println(byteIndex)//[[0 1] [3 4] [5 6] [9 10]]
}
* Update 2 *
lastPos:=-1
for pos, char := range s {
offset +=pos-lastPos-1
fmt.Printf("character %c starts at byte position %d, has an offset of %d, and a char position of %d.
", char, pos,offset,pos-offset)
posMap[pos]=offset
lastPos=pos
}
As you've probably gathered, Go and Java treat strings differently. In Java, a string is a series of codepoints (characters); in Go, a string is a series of bytes. Text manipulation functions in Go understand UTF-8 codepoints when necessary, but since the string is represented as bytes, the indices they return and work with are byte indexes, not character indexes.
As you observe in the comments, you can use a RuneReader
and FindReaderIndex
to get indexes in characters rather than bytes. strings.Reader provides an implementation of RuneReader
, so you can use strings.NewReader
to wrap a string in a RuneReader
.
Another option is to take the substring you want the length of in characters and pass it to utf8.RuneLen, which returns the number of characters in the UTF-8 string. Using a RuneReader
is probably more efficient, however.