如何解压缩/缩小PDF流

Working with the 2016-W4 pdf, which has 2 large streams (page 1 & 2), along with a bunch of other objects and smaller streams. I'm trying to deflate the stream(s), to work with the source data, but am struggling. I'm only able to get corrupt inputs and invalid checksums errors.

I've written a test script to help debug, and have pulled out smaller streams from the file to test with.

Here are 2 streams from the original pdf, along with their length objects:

stream 1:

149 0 obj
<< /Length 150 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 8 8] /Resources 151 0 R >>
stream
x+TT(T0B ,JUWÈS0Ð37±402V(NFJSþ¶
«
endstream
endobj
150 0 obj
42
endobj

stream 2

142 0 obj
<< /Length 143 0 R /Filter /FlateDecode /Type /XObject /Subtype /Form /FormType
1 /BBox [0 0 0 0] /Resources 144 0 R >>
stream
x+Tçã
endstream
endobj
143 0 obj
11
endobj

I copied just the stream contents into new files within Vim (excluding the carriage returns after stream and before endstream).

I've tried both:

  • compress/flate (rfc-1951) – (removing the first 2 bytes (CMF, FLG))
  • compress/zlib (rfc-1950)

I've converted the streams to []byte for the below:

package main

import (
    "bytes"
    "compress/flate"
    "compress/gzip"
    "compress/zlib"
    "fmt"
    "io"
    "os"
)

var (
    flateReaderFn = func(r io.Reader) (io.ReadCloser, error) { return flate.NewReader(r), nil }
    zlibReaderFn  = func(r io.Reader) (io.ReadCloser, error) { return zlib.NewReader(r) }
)

func deflate(b []byte, skip, length int, newReader func(io.Reader) (io.ReadCloser, error)) {
    // rfc-1950
    // --------
    //   First 2 bytes
    //   [120, 1] - CMF, FLG
    //
    //   CMF: 120
    //     0111 1000
    //     ↑    ↑
    //     |    CM(8) = deflate compression method
    //     CINFO(7)   = 32k LZ77 window size
    //
    //   FLG: 1
    //     0001 ← FCHECK
    //            (CMF*256 + FLG) % 31 == 0
    //             120 * 256 + 1 = 30721
    //                             30721 % 31 == 0

    stream := bytes.NewReader(b[skip:length])
    r, err := newReader(stream)
    if err != nil {
        fmt.Println("
failed to create reader,", err)
        return
    }

    n, err := io.Copy(os.Stdout, r)
    if err != nil {
        if n > 0 {
            fmt.Print("
")
        }
        fmt.Println("
failed to write contents from reader,", err)
        return
    }
    fmt.Printf("%d bytes written
", n)
    r.Close()
}

func main() {
    //readerFn, skip := flateReaderFn, 2 // compress/flate RFC-1951, ignore first 2 bytes
    readerFn, skip := zlibReaderFn, 0 // compress/zlib RFC-1950, ignore nothing

    //                                                                                                ⤹ This is where the error occurs: `flate: corrupt input before offset 19`.
    stream1 := []byte{120, 1, 43, 84, 8, 84, 40, 84, 48, 0, 66, 11, 32, 44, 74, 85, 8, 87, 195, 136, 83, 48, 195, 144, 51, 55, 194, 177, 52, 48, 50, 86, 40, 78, 70, 194, 150, 74, 83, 8, 4, 0, 195, 190, 194, 182, 10, 194, 171, 10}
    stream2 := []byte{120, 1, 43, 84, 8, 4, 0, 1, 195, 167, 0, 195, 163, 10}

    fmt.Println("----------------------------------------
Stream 1:")
    deflate(stream1, skip, 42, readerFn) // flate: corrupt input before offset 19

    fmt.Println("----------------------------------------
Stream 2:")
    deflate(stream2, skip, 11, readerFn) // invalid checksum
}

I'm sure I'm doing something wrong somewhere, I just can't quite see it.

(The pdf does open in a viewer)

Binary data should never be copied out of / saved from text editors. There might be cases when this succeeds, and it just adds oil to the flame.

Your data that you eventually "mined out" from the PDF is most likely not identical to the actual data that is in the PDF. You should take the data from a hex editor (e.g. try hecate for something new), or write a simple app that saves it (which strictly handles the file as binary).

Hint #1:

The binary data displayed spread across multiple lines. Binary data does not contain carriage returns, that's a textual control. If it does, that means the editor did interpret it as text, and so some codes / characters where "consumed" to start a new line. Multiple sequences may be interpreted as the same newline (e.g. , ). By excluding them, you're already at data loss, by including them, you might already have a different sequence. And if the data was interpreted and displayed as text, more problems may arise as there are more control characters, and some characters may not appear when displayed.

Hint #2:

When flateReaderFn is used, decoding the 2nd example succeeds (completes without an error). This means "you were barking up the right tree", but the success depends on what the actual data is and to what extent was it "distorted" by the text editor.

Okay, confession time...

I was so caught up in trying to understand deflate that I completely overlooked the fact that Vim wasn't saving the stream contents correctly into new files. So I spent quite a bit of time reading the RFC's, and digging through the internals of the Go compress/... packages, assuming the problem was with my code.

Shortly after I posted my question I tried reading the PDF as a whole, finding the stream/endstream locations, and pushing that through deflate. As soon as I saw the content scroll through the screen I realized my dumb mistake.

+1 @icza, that was exactly my issue.

It was good in then end, as I have a much better understanding of the whole process than if it would have just worked the first go around.