I want to do the same thing in Go as asked in here.
I'm parsing a huge log file, and I need to parse it line by line. On each line I deserialize the line into a struct. The data may come from any data source (file, network etc). So, I receive an io.Reader
in my function. Since the file is huge, I want to split it among many goroutines.
I could have done this easily using io.Pipe
etc. However, I need to split the file without cutting the lines, for example, in half without cutting them in the middle. So that, each goroutine may receive an io.Reader
and then they may work in different parts of the file.
Sometimes, I also need to send io.MultiReader
to my function as well. In that case, I would do the same again. So, it's not necessarily the same file (but mostly it is).
func scan(r io.Reader, pf ProcessFunc) {
// need to split `r` here if `r` is:
// r.(io.ReadSeeker)
// run goroutine #1 with 50% of the stream
// uses bufio.Scanner
// run goroutine #2 with 50% of the stream
// uses bufio.Scanner
// another goroutine is receiving the deserialized values
// and sends them to the ProcessFunc for processing further
// down the pipeline
}
Let's say the data is like this:
foo1 bar2
foo3 bar4
foo5 bar6
foo7 bar8
The goroutine #1 will get an io.Reader like so:
foo1 bar2
foo3 bar4
And the goroutine #2 will get an io.Reader like so:
foo5 bar6
foo7 bar8
But not like this:
o5 bar6 -> breaks the line in the second io.Reader
foo7 bar8
You've got a couple options:
If you have seekable data, you can seek and then will have scan for the next newline so you can make sure that you only split on line breaks
Pass lines into a goroutine instead of io.readers. Basically, each goroutine would have a channel and the main routing would feed each line from the io.reader into the channels.
Split the file before hand with something like split