I'm converting a Go program that decodes email messages. It currently runs iconv to do the actual decoding, which of course has overhead. I would like to use the golang.org/x/text/transform
and golang.org/x/net/html/charset
packages to do this. Here is working code:
// cs is the charset that the email body is encoded with, pulled from
// the Content-Type declaration.
enc, name := charset.Lookup(cs)
if enc == nil {
log.Fatalf("Can't find %s", cs)
}
// body is the email body we're converting to utf-8
r := transform.NewReader(strings.NewReader(body), enc.NewDecoder())
// result contains the converted-to-utf8 email body
result, err := ioutil.ReadAll(r)
That works great except for when it encounters illegal bytes, which unfortunately is not an uncommon experience when dealing with email in the wild. ioutil.ReadAll() returns an error and all the converted bytes up until the problem. Is there a way to tell the transform package to ignore illegal bytes? Right now, we use the -c flag to iconv to do that. I've gone through the docs for the transform package, and I can't tell if it's possible or not.
UPDATE: Here's a test program that shows the problem (the Go playground doesn't have the charset or transform packages...). The raw text was taken from an actual email. Yep, it's in English, and yep, the charset in the email was set to EUC-KR. I need it to ignore that apostrophe.
package main
import (
"io/ioutil"
"log"
"strings"
"golang.org/x/net/html/charset"
"golang.org/x/text/transform"
)
func main() {
raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
enc, _ := charset.Lookup("euc-kr")
r := transform.NewReader(strings.NewReader(raw), enc.NewDecoder())
result, err := ioutil.ReadAll(r)
if err != nil {
log.Printf("ReadAll returned %s", err)
}
log.Printf("RESULT: '%s'", string(result))
}
Here is the solution I went with. Instead of using a Reader, I allocate the destination buffer by hand and call the Transform()
function directly. When Transform()
errors out, I check for a short destination buffer, and reallocate if necessary. Otherwise I skip a rune, assuming that it is the illegal character. For completeness, I should also check for a short input buffer, but I do not do so in this example.
raw := `So, at 64 kBps, or kilobits per second, you’re getting 8 kilobytes a second.`
enc, _ := charset.Lookup("euc-kr")
dst := make([]byte, len(raw))
d := enc.NewDecoder()
var (
in int
out int
)
for in < len(raw) {
// Do the transformation
ndst, nsrc, err := d.Transform(dst[out:], []byte(raw[in:]), true)
in += nsrc
out += ndst
if err == nil {
// Completed transformation
break
}
if err == transform.ErrShortDst {
// Our output buffer is too small, so we need to grow it
log.Printf("Short")
t := make([]byte, (cap(dst)+1)*2)
copy(t, dst)
dst = t
continue
}
// We're here because of at least one illegal character. Skip over the current rune
// and try again.
_, width := utf8.DecodeRuneInString(raw[in:])
in += width
}
enc.NewDecoder()
results in a transform.Transformer
. The doc of NewDecoder()
says:
Transforming source bytes that are not of that encoding will not result in an error per se. Each byte that cannot be transcoded will be represented in the output by the UTF-8 encoding of '\uFFFD', the replacement rune.
This tells us it is the reader failing on the replacement rune (also known as the error rune). Fortunately it is easy to strip those out.
golang.org/x/text/transform
provides two helper functions we can use to solve this problem. Chain()
takes a set of transformers and chains them together. RemoveFunc()
takes a function and filters out all bytes for which it returns true.
Something like the following (untested) should work:
filter := transform.Chain(enc.NewDecoder(), transform.RemoveFunc(func (r rune) bool {
return r == utf8.RuneError
}))
r := transform.NewReader(strings.NewReader(body), filter)
That should filter out all rune-errors before they get to the reader and blow up.