I want to read and split large text file (near 3GB) to blocks with n symbols length. I was trying to read file and split using runes, but it takes a lot of memory.
func SplitSubN(s string, n int) []string {
sub := ""
subs := []string{}
runes := bytes.Runes([]byte(s))
l := len(runes)
for i, r := range runes {
sub = sub + string(r)
if (i+1)%n == 0 {
subs = append(subs, sub)
sub = ""
} else if (i + 1) == l {
subs = append(subs, sub)
}
}
return subs
}
I suppose it can be done in smarter way, like a phased reading of blocks of a certain length from file, but I don't know how to do it correctly.
Scan for rune start bytes and split based on that. This eliminates all allocations within the function except for the allocation of the result slice.
func SplitSubN(s string, n int) []string {
if len(s) == 0 {
return nil
}
m := 0
i := 0
j := 1
var result []string
for ; j < len(s); j++ {
if utf8.RuneStart(s[j]) {
if (m+1)%n == 0 {
result = append(result, s[i:j])
i = j
}
m++
}
}
if j > i {
result = append(result, s[i:j])
}
return result
}
The API specified in the question requires that the application allocate memory when converting the []byte read from the file to a string. This allocation can be avoided by changing the function to work on bytes:
func SplitSubN(s []byte, n int) [][]byte {
if len(s) == 0 {
return nil
}
m := 0
i := 0
j := 1
var result [][]byte
for ; j < len(s); j++ {
if utf8.RuneStart(s[j]) {
if (m+1)%n == 0 {
result = append(result, s[i:j])
i = j
}
m++
}
}
if j > i {
result = append(result, s[i:j])
}
return result
}
Both of these functions require that the application slurp the entire file into memory. I assume that's OK because the function in the question does as well. If you only need to process one chunk at a time, then the above code be be adapted to scan as the file is read incrementally.
Actually, the most interesting part is not parsing chunk itself but rather handling characters overlapping.
For example, if you read from a file say in chunks of N
bytes but last multi-byte char is read partially (the remainder will be read in the next iteration).
Here is a solution that reads a text file by given chunks and handles characters overlapping in async manner:
package main
import (
"fmt"
"io"
"log"
"os"
"unicode/utf8"
)
func main() {
data, err := ReadInChunks("testfile", 1024*16)
competed := false
for ; !competed; {
select {
case next := <-data:
if next == nil {
competed = true
break
}
fmt.Printf(string(next))
case e := <-err:
if e != nil {
log.Fatalf("error: %s", e)
}
}
}
}
func ReadInChunks(path string, readChunkSize int) (data chan []rune, err chan error) {
var readChanel = make(chan []rune)
var errorChanel = make(chan error)
onDone := func() {
close(readChanel)
close(errorChanel)
}
onError := func(err error) {
errorChanel <- err
onDone()
}
go func() {
if _, err := os.Stat(path); os.IsNotExist(err) {
onError(fmt.Errorf("file [%s] does not exist", path))
return
}
f, err := os.Open(path)
if err != nil {
onError(err)
return
}
defer f.Close()
readBuf := make([]byte, readChunkSize)
reminder := 0
for {
read, err := f.Read(readBuf[reminder:])
if err == io.EOF {
onDone()
return
}
if err != nil {
onError(err)
}
runes, parsed := runes(readBuf[:reminder+read])
if reminder = readChunkSize - parsed; reminder > 0 {
copy(readBuf[:reminder], readBuf[readChunkSize-reminder:])
}
if len(runes) > 0 {
readChanel <- runes
}
}
}()
return readChanel, errorChanel
}
func runes(nextBuffer []byte) ([]rune, int) {
t := make([]rune, utf8.RuneCount(nextBuffer))
i := 0
var size = len(nextBuffer)
var read = 0
for len(nextBuffer) > 0 {
r, l := utf8.DecodeRune(nextBuffer)
runeLen := utf8.RuneLen(r)
if read+runeLen > size {
break
}
read += runeLen
t[i] = r
i++
nextBuffer = nextBuffer[l:]
}
return t[:i], read
}
It can be greatly simplified if the file is ACSII.
Alternativly, if you need to support unicode you can play aroud UTF-32 (which has fixed length) or UTF-16 (if you don't need to handle >2-bytes, you can treat it as fixed-size as well)