We have big dataset - couple of tens of csv files, that ~130Gb each. We must emulate sql query on csv table.
When we're reading test table using encoding/csv
on test 1.1 Gb file - program allocates 526 Gb of virtual memory. Why? csv.Reader
works like generator, when we using reader.Read()
method, or it keeps row in memory?
Full code after codereview.
UPD
Reading file like:
rf, err := os.Open(input_file)
if err != nil {
log.Fatal("Error: %s", err)
}
r := csv.NewReader(rf)
for {
record, err := r.Read()
}
Falling on line record, err := r.Read()
with memory error.
UPD2 Snapshot of memory during read process:
2731.44MB 94.63% 94.63% 2731.44MB 94.63% encoding/csv.(*Reader).parseRecord
151MB 5.23% 99.86% 2885.96MB 100% main.main
0 0% 99.86% 2731.44MB 94.63% encoding/csv.(*Reader).Read
0 0% 99.86% 2886.49MB 100% runtime.goexit
0 0% 99.86% 2886.49MB 100% runtime.main
Most likely the line breaks aren't being detected and its reading everything as a single record.
https://golang.org/src/encoding/csv/reader.go?s=4071:4123#L124
If you follow the code to line 210, you'll see it look for ' '
.
Often times I see line breaks defined as when some system exported it, thinking they were being Windows-smart when in fact it's wrong. The correct Windows linebreak is
.
Alternatively, you can write a custom Scanner
that will deliminate the lines for you using whatever technique you have in your input, and use it as the io.Reader
input for your csv.Reader
. For example, to use the invalid I mentioned above.