I have a large string in Go and I'd like to split it up into smaller chunks. Each chunk should be at most 10kb. The chunks should be split on runes (not in the middle of a rune).
What is the idiomatic way to do this in go? Should I just be looping over the range of the string bytes? Am I missing some helpful stdlib packages?
The idiomatic way to split a string (or any slice or array) is by using slicing. Since you want to split by rune you'd have to loop through the entire string since you don't know in advance how many bytes each slice will contain.
slices := []string{}
count := 0
lastIndex := 0
for i, r := range longString {
count++
if count%10001 == 0 {
slices = append(slices, longString[lastIndex:i])
lastIndex = i
}
}
Warning: I have not run or tested this code, but it conveys the general principles. Looping over a string loops over the runes and not the bytes, automatically decoding the UTF-8 for you. And using the slice operator []
represents your new strings as subslices of longString
which means that no bytes from the string needs to be copied.
Note that i
is the byte index in the string and may be incremented by more that 1 in each loop iteration.
EDIT:
Sorry, I didn't see you wanted to limit the number of bytes, not Unicode code points. You can implement that as well relatively easily.
slices := []string{}
lastIndex := 0
lastI := 0
for i, r := range longString {
if i-lastIndex > 10000 {
slices = append(slices, longString[lastIndex:lastI])
lastIndex = lastI
}
lastI = i
}
A working example at play.golang.org, which also takes care of the leftover bytes at the end of the string.
Use RuneStart to scan for a rune boundary. Slice the string at the boundary.
var chunks []string
for len(s) > 10000 {
i := 10000
for i >= 10000 - utf8.UTFMax && !utf8.RuneStart(s[i]) {
i--
}
chunks = append(chunks, s[:i])
s = s[i:]
}
if len(s) > 0 {
chunks = append(chunks, s)
}
Using the approach, the application examines a few bytes at the chunk boundaries instead of the entire string.
The code is written to guarantee progress when the string is not a valid UTF-8 encoding. You might want to handle this situation as an error or split the string in a different way.
Check out this code:
package main
import (
"fmt"
"math/rand"
"time"
)
func init() {
rand.Seed(time.Now().UnixNano())
}
var alphabet = []rune{
'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p',
'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'æ', 'ø', 'å', 'A', 'B', 'C',
'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S',
'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'Æ', 'Ø', 'Å',
}
func randomString(n int) string {
b := make([]rune, n, n)
for k, _ := range b {
b[k] = alphabet[rand.Intn(len(alphabet))]
}
return string(b)
}
const (
chunkSize int = 100
lead4Mask byte = 0xF8 // must equal 0xF0
lead3Mask byte = 0xF0 // must equal 0xE0
lead2Mask byte = 0xE0 // must equal 0xC0
lead1Mask byte = 0x80 // must equal 0x00
trailMask byte = 0xC0 // must equal 0x80
)
func longestPrefix(s string, n int) int {
for i := (n - 1); ; i-- {
if (s[i] & lead1Mask) == 0x00 {
return i + 1
}
if (s[i] & trailMask) != 0x80 {
return i
}
}
panic("never reached")
}
func main() {
s := randomString(100000)
for len(s) > chunkSize {
cut := longestPrefix(s, chunkSize)
fmt.Println(s[:cut])
s = s[cut:]
}
fmt.Println(s)
}
I'm using the danish/norwegian alphabet to generate a random string of 100000 runes.
Then, the "magic" lays in longestPrefix
. To help you with the bit-shifting part, refer to the following graphic:
The program prints out the respective longest possible chunks <= chunkSize, one per line.