转到:读取zip文件中的行块[关闭]

I need to read a block of n lines in a zip files quickly as possible.

I'm beginer in Go. For bash lovers, I want to do the same as (to get a block of 500 lines between lines 199500 and 200000):

time query=$(zcat fake_contacts_200k.zip | sed '199500,200000!d')

real    0m0.106s
user    0m0.119s
sys 0m0.013s

Any idea is welcome.

  1. Import archive/zip.

  2. Open and read the archive file as shown in the example right there in the docs.

    • Note that in order to mimic the behaviour of zcat you have to first check the length of the File field of the zip.ReadCloser instance returned by a call to zip.OpenReader, and fail if it is not equal to 1 — that is, there is no files in the archive or there are two or more files in it¹.

    • Note that you have to check the error value returned by a call to zip.OpenReader for being equal to zip.ErrFormat, and if it's equal, you have to:

      • Close the returned zip.ReadCloser.
      • Try to reinterpret the file as being gzip-formatted (step 4).
  3. Take the first (and sole) File member and call Open on it.

    You can then read the file's contents from the returned io.ReaderCloser.

    After reading, you need to call Close() on that instance and then close the zip file as well. That's all. ∎

  4. If step (2) failed because the file did not have the zip format, you'd test whether it's gzip-formatted.

    In order to do this, you do basically the same steps using the compress/gzip package.

    Note that contrary to the zip format, gzip does not provide file archival — it's merely a compressor, so there's no meta information on any files in the gzip stream, just the compressed data. (This fact is underlined by the difference in the names of the packages.)

    If an attempt to opening the same file as a gzip archive returns the gzip.ErrHeader error, you bail out, otherwise you read the data after which you close the reader. That's all. ∎

To process just the specific lines from the decompressed file, you'd need to

  1. Skip the lines before the first one to process.
  2. Process the lines until, and including the last one to process.
  3. Stop processing.

To interpret the data read from an io.Reader or io.ReadCloser, it's best to use bufio.Scanner — see the "Example (Lines)" there.

P.S.

Please read thoroughly this essay to try to make your next question better that this one.


¹ You might as well read all the files and interpret their contents as a contiguous stream — that would deviate from the behaviour of zcat but that might be better. It really depends on your data.