My golang CSV processing routine copies almost exactly from the Package CSV example:
func processCSV(path string){
file:= utils.OpenFile(path)
reader:= csv.NewReader(file)
reader.LazyQuotes = true
cs:= []*Collision{} //defined elsewhere
for {
line, err := reader.Read()
//Kill processing if we're at EOF
if err == io.EOF {
break
}
c := get(line) //defined elsewhere
cs= append(cs, c)
}
//Do other stuff...
}
The code works great until it encounters a malformed (?) line of CSV, which generally looks something like this:
item1,item2,"item3,"has odd quoting"","item4",item5
The csvReader.LazyQuotes = true option doesn't seem to offer enough tolerance to read this line as I need it.
My question is this: can I ask the csv reader for the original line so that I can "massage" it to pull out what I need? The files I'm working with are moderately large (~150mb) and I'm not sure I want to re-do them, especially as only a few lines per file have such problems.
Thanks for any tips!
As far as I can tell encoding/csv
doesn't provide any such functionality, so you can either look for some 3rd party csv package that does that, or you can implement a solution yourself.
If you want to go the DIY route I can offer you a tip, whether it's a good tip that you should implement is up to you.
You could implement an io.Reader
that wraps your file and tracks the last line read, then every time you encouter an error because of malformed csv you can use your reader to reread that line, massage it, add it to the results, and have the loop continue as if nothing happened.
Here's an example of how your processCSV
would change:
func processCSV(path string){
file := utils.OpenFile(path)
myreader := NewMyReader(file)
reader := csv.NewReader(myreader)
reader.LazyQuotes = true
cs:= []*Collision{} //defined elsewhere
for {
line, err := reader.Read()
//Kill processing if we're at EOF
if err == io.EOF {
break
}
// malformed csv
if err != nil {
// Just reread the last line and on the next iteration of
// the loop myreader.Read should continue returning bytes
// that come after this malformed line to the csv.Reader.
l, err := myreader.CurrentLine()
if err != nil {
panic(err)
}
// massage the malformed csv line
line = fixcsv(l)
}
c := get(line) //defined elsewhere
cs= append(cs, c)
}
//Do other stuff...
}
Looking at the implementation of csv.Read()
you cannot do what you are looking for with the csv
package. It uses a module-private function parseRecord()
which does the hard work.
I think what you need is write your own CSV reader which will handle this cases or simply preprocess the file line by line so that malformed items would be for example replaced from "
to \"
(which csv
package could handle correctly).
I "solved" this problem using a hint from mkopriva and blatant copying from Go's CSV parsing code. If I read it right, Go's CSV parser is rather clever about what it considers a line. When I've written a naive CSV parser, I've split files on new lines, and then processed them. Go's parser is smarter, and includes the possibility that a quoted field might itself contain a new line. In those cases, my code would fail and theirs would work.
Feeding "lines" to Go's parser is a bit tricky, as it's reading through a stream looking for line-beginning-and-ending patterns and extracting fields along the way. What I did was hijack the code and add a variable that tracks the beginning and end of the stream that the code considers a line. My additions probably have problems, but seem to work for me. If it helps, here are the steps I took:
1) Copy the CSV source and paste into my project in its entirety.
2) Add a new field to type Reader struct {}:
type Reader struct {
...
// The i'th field starts at offset fieldIndexes[i] in lineBuffer.
fieldIndexes []int
CurrentLine []byte //Added struct field to hold onto the line
...
}
3) In readRune(), capture bytes as they come in, like so:
func (r *Reader) readRune() (rune, error) {
r1, _, err := r.r.ReadRune()
r.CurrentLine = append(r.CurrentLine, byte(r1)) //added: stores bytes as processed
...
}
4) in Read(), reset CurrentLine for each line, like so:
func (r *Reader) Read() (record []string, err error) {
r.CurrentLine = []byte{} //added: reset line capturing
...
}
With these items added, I can then grab the current line when there's a parsing error, as per mkopriva's suggestion:
...
if err != nil {
line = fixCSV(csvReader.CurrentLine)
continue
}
...