I have some large json files I want to parse, and I want to avoid loading all of the data into memory at once. I'd like a function/loop that can return me each character one at a time.
I found this example for iterating over words in a string, and the ScanRunes function in the bufio package looks like it could return a character at a time. I also had the ReadRune
function from bufio mostly working, but that felt like a pretty heavy approach.
I compared 3 approaches. All used a loop to pull content from either a bufio.Reader or a bufio.Scanner.
.ReadRune
on a bufio.Reader
. Checked for errors from the call to .ReadRune
.bufio.Scanner
after calling .Split(bufio.ScanRunes)
on the scanner. Called .Scan
and .Bytes
on each iteration, checking .Scan
call for errors.bufio.Scanner
instead of bytes using .Text
. Instead of joining a slice of runes with string([]runes)
, I joined an slice of strings with strings.Join([]strings, "")
to form the final blobs of text.The timing for 10 runs of each on a 23 MB json file was:
0.65 s
2.40 s
0.97 s
So it looks like ReadRune
is not too bad after all. It also results in smaller less verbose call because each rune is fetched in 1 operation (.ReadRune
) instead of 2 (.Scan
and .Bytes
).
Just read each rune one by one in the loop... See example
EDIT: Adding code for posterity, in case link ever dies:
package main
import (
"bufio"
"fmt"
"io"
"log"
"strings"
)
var text = `
The quick brown fox jumps over the lazy dog #1.
Быстрая коричневая лиса перепрыгнула через ленивую собаку.
`
func main() {
r := bufio.NewReader(strings.NewReader(text))
for {
if c, sz, err := r.ReadRune(); err != nil {
if err == io.EOF {
break
} else {
log.Fatal(err)
}
} else {
fmt.Printf("%q [%d]
", string(c), sz)
}
}
}
This code reads runes from the input. No cast is necessary, and it is iterator-like:
package main
import (
"bufio"
"fmt"
"strings"
)
func main() {
in := `{"sample":"json string"}`
s := bufio.NewScanner(strings.NewReader(in))
s.Split(bufio.ScanRunes)
for s.Scan() {
fmt.Println(s.Text())
}
}
if it's just about the memory size. In the upcoming release (really soon) there is going to be a token style enhancement of the json decoder : you can see it here