I'm reading in and at the same time parsing (decoding) a file in a custom format, which is compressed with zlib. My question is how can I efficiently uncompress and then parse the uncompressed content without growing the slice? I would like to parse it whilst reading it into a reusable buffer.
This is for a speed-sensitive application and so I'd like to read it in as efficiently as possible. Normally I would just ioutil.ReadAll
and then loop again through the data to parse it. This time I'd like to parse it as it's read, without having to grow the buffer into which it is read, for maximum efficiency.
Basically I'm thinking that if I can find a buffer of the perfect size then I can read into this, parse it, and then write over the buffer again, then parse that, etc. The issue here is that the zlib reader appears to read an arbitrary number of bytes each time Read(b)
is called; it does not fill the slice. Because of this I don't know what the perfect buffer size would be. I'm concerned that it might break up some of the data that I wrote into two chunks, making it difficult to parse because one say uint64 could be split from into two reads and therefore not occur in the same buffer read - or perhaps that can never happen and it's always read out in chunks of the same size as were originally written?
f.Write(b []byte)
is it possible that this same data could be split into two reads when reading back the compressed data (meaning I will have to have a history during parsing), or will it always come back in the same read?OK, so I figured this out in the end using my own implementation of a reader.
Basically the struct looks like this:
type reader struct {
at int
n int
f io.ReadCloser
buf []byte
}
This can be attached to the zlib reader:
// Open file for reading
fi, err := os.Open(filename)
if err != nil {
return nil, err
}
defer fi.Close()
// Attach zlib reader
r := new(reader)
r.buf = make([]byte, 2048)
r.f, err = zlib.NewReader(fi)
if err != nil {
return nil, err
}
defer r.f.Close()
Then x number of bytes can be read straight out of the zlib reader using a function like this:
mydata := r.readx(10)
func (r *reader) readx(x int) []byte {
for r.n < x {
copy(r.buf, r.buf[r.at:r.at+r.n])
r.at = 0
m, err := r.f.Read(r.buf[r.n:])
if err != nil {
panic(err)
}
r.n += m
}
tmp := make([]byte, x)
copy(tmp, r.buf[r.at:r.at+x]) // must be copied to avoid memory leak
r.at += x
r.n -= x
return tmp
}
Note that I have no need to check for EOF because I my parser should stop itself at the right place.
You can wrap your zlib reader in a bufio reader, then implement a specialized reader on top that will rebuild your chunks of data by reading from the bufio reader until a full chunk is read. Be aware that bufio.Read calls Read at most once on the underlying Reader, so you need to call ReadByte in a loop. bufio will however take care of the unpredictable size of data returned by the zlib reader for you.
If you do not want to implement a specialized reader, you can just go with a bufio reader and read as many bytes as needed with ReadByte() to fill a given data type. The optimal buffer size is at least the size of your largest data structure, up to whatever you can shove into memory.
If you read directly from the zlib reader, there is no guarantee that your data won't be split between two reads.
Another, maybe cleaner, solution is to implement a writer for your data, then use io.Copy(your_writer, zlib_reader).