I need to read a block of n lines in a zip files quickly as possible.
I'm beginer in Go. For bash lovers, I want to do the same as (to get a block of 500 lines between lines 199500 and 200000):
time query=$(zcat fake_contacts_200k.zip | sed '199500,200000!d')
real 0m0.106s
user 0m0.119s
sys 0m0.013s
Any idea is welcome.
Import archive/zip
.
Open and read the archive file as shown in the example right there in the docs.
Note that in order to mimic the behaviour of zcat
you have to first check the length of the File
field of the zip.ReadCloser
instance returned by a call to zip.OpenReader
, and fail if it is not equal to 1 — that is, there is no files in the archive or there are two or more files in it¹.
Note that you have to check the error value returned by a call to zip.OpenReader
for being equal to zip.ErrFormat
, and if it's equal, you have to:
zip.ReadCloser
.gzip
-formatted (step 4).Take the first (and sole) File
member and call Open
on it.
You can then read the file's contents from the returned io.ReaderCloser
.
After reading, you need to call Close()
on that instance and then close the zip file as well. That's all. ∎
If step (2) failed because the file did not have the zip format, you'd test whether it's gzip-formatted.
In order to do this, you do basically the same steps using the compress/gzip
package.
Note that contrary to the zip format, gzip does not provide file archival — it's merely a compressor, so there's no meta information on any files in the gzip stream, just the compressed data. (This fact is underlined by the difference in the names of the packages.)
If an attempt to opening the same file as a gzip archive returns the gzip.ErrHeader
error, you bail out, otherwise you read the data after which you close the reader. That's all. ∎
To process just the specific lines from the decompressed file, you'd need to
To interpret the data read from an io.Reader
or io.ReadCloser
, it's best to use bufio.Scanner
— see the "Example (Lines)" there.
P.S.
Please read thoroughly this essay to try to make your next question better that this one.
¹ You might as well read all the files and interpret their contents as a contiguous stream — that would deviate from the behaviour of zcat
but that might be better. It really depends on your data.