I'm trying to parse emails and I get this kind of errors using the mail package. Is it a bug on the mail package or something I should handle myself ?
missing word in phrase: charset not supported: "gb18030"
charset not supported: "koi8-r"
missing word in phrase: charset not supported: "ks_c_5601-1987"
How can I fix them ? I think I should use charset but I'm not sure how . Here's how an email header looks like
Received: from smtpbg303.qq.com ([184.105.206.26]) by mx-ha.gmx.net
(mxgmxus001) with ESMTPS (Nemesis) id 0MAOx2-1X2yNC2ZFC-00BaVU for
<sormester@lobbyist.com>; Sat, 14 Jun 2014 18:11:48 +0200
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qq.com; s=s201307;
t=1402762305; bh=imEvSr8IPsqWTXU63xUHRv+wuQG+Tcz2mPP9ai4rrE4=;
h=X-QQ-FEAT:X-QQ-SSF:X-HAS-ATTACH:X-QQ-BUSINESS-ORIGIN:
X-Originating-IP:In-Reply-To:References:X-QQ-STYLE:X-QQ-mid:From:To:Subject:Mime-Version:Content-Type:Content-Transfer-Encoding:Date:
X-Priority:Message-ID:X-QQ-MIME:X-Mailer:X-QQ-Mailer:
X-QQ-ReplyHash:X-QQ-SENDSIZE:X-QQ-FName:X-QQ-LocalIP;
b=QXs4CveboS8nG6htN9W6amC3X+F7X3ZtFrt6jrjWI+RmbvqBuTCVmX9IlaqCX84H8
n14x2Wp7x4kDYcNRqhe+HjTpf715TTQXc4d40b9e38frC/5qIhpMtYNsD8iEJwRzHW
U3xi8Yq7OCIB303fIpytx8tOjexQpZKSHbJ7ecX0=
X-QQ-FEAT: zaIfg0hwV2pIDflZYPQUsuPPXG5wtRVHJU6PiOYLBBA=
X-QQ-SSF: 00010000000000F000000000000000L
X-HAS-ATTACH: no
X-QQ-BUSINESS-ORIGIN: 2
X-Originating-IP: 180.155.99.102
In-Reply-To: <trinity-b7c6d611-52fd-4afa-b739-2deb243532a6-1402761364579@3capp-mailcom-lxa05>
References: <97e07dab7c2d1a005ed928c4350690e0@hotels-desk.co.uk>,
<tencent_105D3DC11702F53465C0025D@qq.com>
<trinity-b7c6d611-52fd-4afa-b739-2deb243532a6-1402761364579@3capp-mailcom-lxa05>
X-QQ-STYLE:
X-QQ-mid: webmail474t1402762303t356131
From: "=?gb18030?B?08bTzg==?=" <38438nx@qq.com>
To: "=?gb18030?B?V2lsaGVsbSBLdW1tZXI=?=" <sormester@lobbyist.com>
Subject: =?gb18030?B?u9i4tKO6ILvYuLSjulBhbGFjZSBXZXN0bWluc3Rl?=
=?gb18030?B?cjogMDEtMDctMjAxNCAtIDA0LTA3LTIwMTQ=?=
Mime-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_539C743F_08A07490_0157E268"
Content-Transfer-Encoding: 8Bit
Date: Sun, 15 Jun 2014 00:11:43 +0800
X-Priority: 3
Message-ID: <tencent_573A737E73016B9F5A3D10C1@qq.com>
X-QQ-MIME: TCMime 1.0 by Tencent
X-Mailer: QQMail 2.x
X-QQ-Mailer: QQMail 2.x
X-QQ-ReplyHash: 170675637
X-QQ-SENDSIZE: 520
X-QQ-FName: 7B2EFFAD16B8462B84D3499A4CC7DDEF
X-QQ-LocalIP: 163.177.66.155
Envelope-To: <sormester@lobbyist.com>
X-GMX-Antispam: 0 (Mail was not recognized as spam); Detail=V3;
X-GMX-Antivirus: 0 (no virus found)
Edit:
I've tried to use the charset package it but it has no effect. I still get the same error on the same messages.
import "code.google.com/p/go-imap/go1/imap"
header := imap.AsBytes(rsp.MessageInfo().Attrs["RFC822.HEADER"])
r, err := charset.NewReader("UTF-8", bytes.NewReader(header))
if err != nil {
log.Fatal(err)
}
fmt.Printf("new char is %v", r)
msg, err := mail.ReadMessage(r)
if err != nil {
log.Fatal(err)
return mgs, err
}
mg.From, err = msg.Header.AddressList("From")
if err != nil {
log.Errorf("NO FROM msg %s, err %v", header, err)
return
}
The mail package seems to be able to decode only rfc2047
but the charset package doesn't support this
character set "rfc2047" not found
It seems mahonia which could fix the issue?
I hope this helps someone who may consider Go to process emails(i.e develop client apps). It seems the standard Go standard library is not mature enough for email processing. It doesn't handle multi-part, different char sets etc. After almost a day trying different hacks and packages I've decided to just throw the go code away and use an old good JavaMail solution.
Alexey Vasiliev's MIT-licensed http://github.com/le0pard/go-falcon/ includes a parser
package that applies whichever encoding package is needed to decode the headers (the meat is in utils.go).
package main
import (
"bufio"
"bytes"
"fmt"
"net/textproto"
"github.com/le0pard/go-falcon/parser"
)
var msg = []byte(`Subject: =?gb18030?B?u9i4tKO6ILvYuLSjulBhbGFjZSBXZXN0bWluc3Rl?=
=?gb18030?B?cjogMDEtMDctMjAxNCAtIDA0LTA3LTIwMTQ=?=
`)
func main() {
tpr := textproto.NewReader(bufio.NewReader(bytes.NewBuffer(msg)))
mh, err := tpr.ReadMIMEHeader()
if err != nil {
panic(err)
}
for name, vals := range mh {
for _, val := range vals {
val = parser.MimeHeaderDecode(val)
fmt.Print(name, ": ", val, "
")
}
}
}
It looks like its parser.FixEncodingAndCharsetOfPart
is used by the package to decode/convert content as well, though with a couple of extra allocations caused by converting the []byte
body to/from a string
. If you don't find the API works for you, you might at least be able to use the code to see how it can be done.
Found via godoc.org's "...and is imported by 3 packages" link from encoding/simplifiedchinese -- hooray godoc.org!
I've been using github.com/jhillyerd/enmime which seems to have no trouble with this. It'll parse out both headers and body content. Given an io.Reader
r
:
// Parse message body
env, _ := enmime.ReadEnvelope(r)
// Headers can be retrieved via Envelope.GetHeader(name).
fmt.Printf("From: %v
", env.GetHeader("From"))
// Address-type headers can be parsed into a list of decoded mail.Address structs.
alist, _ := env.AddressList("To")
for _, addr := range alist {
fmt.Printf("To: %s <%s>
", addr.Name, addr.Address)
}
fmt.Printf("Subject: %v
", env.GetHeader("Subject"))
// The plain text body is available as mime.Text.
fmt.Printf("Text Body: %v chars
", len(env.Text))
// The HTML body is stored in mime.HTML.
fmt.Printf("HTML Body: %v chars
", len(env.HTML))
// mime.Inlines is a slice of inlined attacments.
fmt.Printf("Inlines: %v
", len(env.Inlines))
// mime.Attachments contains the non-inline attachments.
fmt.Printf("Attachments: %v
", len(env.Attachments))