走。 使用巨大的csv文件

We have big dataset - couple of tens of csv files, that ~130Gb each. We must emulate sql query on csv table.

When we're reading test table using encoding/csv on test 1.1 Gb file - program allocates 526 Gb of virtual memory. Why? csv.Reader works like generator, when we using reader.Read() method, or it keeps row in memory?

Full code after codereview.

UPD

Reading file like:

rf, err := os.Open(input_file)
if err != nil {
    log.Fatal("Error: %s", err)
}
r := csv.NewReader(rf)
for {
    record, err := r.Read()
}

Falling on line record, err := r.Read() with memory error.

UPD2 Snapshot of memory during read process:

 2731.44MB 94.63% 94.63%  2731.44MB 94.63%  encoding/csv.(*Reader).parseRecord
     151MB  5.23% 99.86%  2885.96MB   100%  main.main
         0     0% 99.86%  2731.44MB 94.63%  encoding/csv.(*Reader).Read
         0     0% 99.86%  2886.49MB   100%  runtime.goexit
         0     0% 99.86%  2886.49MB   100%  runtime.main

Most likely the line breaks aren't being detected and its reading everything as a single record.

https://golang.org/src/encoding/csv/reader.go?s=4071:4123#L124

If you follow the code to line 210, you'll see it look for ' '.

Often times I see line breaks defined as when some system exported it, thinking they were being Windows-smart when in fact it's wrong. The correct Windows linebreak is .

Alternatively, you can write a custom Scanner that will deliminate the lines for you using whatever technique you have in your input, and use it as the io.Reader input for your csv.Reader. For example, to use the invalid I mentioned above.