I am trying to compute the sha256 sum of a gzipped file in Go, but my output does not match that of the gzip
command.
I have a function Compress
that gzips the contents of an io.Reader
, a file in my case.
func Compress(r io.Reader) (io.Reader, error) {
var buf bytes.Buffer
zw := gzip.NewWriter(&buf)
if _, err := io.Copy(zw, r); err != nil {
return nil, err
}
if err := zw.Close(); err != nil {
return nil, err
}
return &buf, nil
}
Then I have a function Sum256
that computes the sha256 sum of a reader.
func Sum256(r io.Reader) (sum []byte, err error) {
h := sha256.New()
if _, err := io.Copy(h, r); err != nil {
return nil, err
}
return h.Sum(nil), nil
}
My main function opens a file, gzips it, then computes the sha256 sum of the zipped contents. The problem is that the output does not match that of the gzip
command. The input file hello.txt
contains a single line with the word hello
with no newline at the end.
func main() {
uncompressed, err := os.Open("hello.txt")
if err != nil {
log.Fatal(err)
}
defer uncompressed.Close()
sum, err := Sum256(uncompressed)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%x %s
", sum, uncompressed.Name())
uncompressed.Seek(0, 0)
compressed, err := Compress(uncompressed)
if err != nil {
log.Fatal(err)
}
sum, err = Sum256(compressed)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%x %s.gz
", sum, uncompressed.Name())
}
gzip
results:
$ sha256sum hello.txt
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 hello.txt
$ gzip -c hello.txt | sha256sum
809d7f11e97291d06189e82ca09a1a0a4a66a3c85a24ac7ff389ae6fbe02bcce -
$ gzip -nc hello.txt | sha256sum
f901eda57fd86d4239806fd4b76f64036c1c20711267a7bc776ab2aa45069b2a -
My program results:
$ go run main.go
# match
2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824 hello.txt
# mismatch
3429ae8bc6346f1e4fb67b7d788f85f4637e726a725cf4b66c521903d0ab3b07 hello.txt.gz
Any idea why the outputs don't match or on how to fix this? I have tried using an io.Pipe
, ioutil.TempFile
file, and other methods with the same issue.
In particular, note that if you run the command:
gzip -c hello.txt
The output will contain the filename, hello.txt. You can see this with hexdump:
$ touch hello.txt; gzip -c hello.txt | hexdump -C 00000000 1f 8b 08 08 ad 1b 14 5c 00 03 68 65 6c 6c 6f 2e |.......\..hello.| 00000010 74 78 74 00 03 00 00 00 00 00 00 00 00 00 |txt...........| 0000001e
If you just copy data into a Gzip stream in your program, the filename won't be there. So you must get different results, and the SHA-256 sum should be different.
However, even if you fix this particular defect... you are still not guaranteed to get the same results by running Gzip on the same file.
If you want the checksum to be the same, run the checksum on the decompressed data instead.