I have following code, which try save UUID raw 16 bytes (with 0x0A inside) to CSV format
package main
import (
"encoding/csv"
"github.com/satori/go.uuid"
"log"
"os"
)
func main() {
u, err := uuid.FromString("e1393c62-877a-4adc-8ffb-f1bf0a337c5f")
if err != nil {
log.Fatal(err)
}
csv_file, err := os.OpenFile("csv_wtf.csv", os.O_WRONLY|os.O_CREATE, 0644)
if err != nil {
log.Fatal(err)
}
s := string(u.Bytes())
log.Printf("len(s)=%d",len(s))
csv_writer := csv.NewWriter(csv_file)
csv_writer.UseCRLF = false
csv_writer.Write([]string{s})
csv_writer.Flush()
finfo, err := csv_file.Stat()
if err != nil {
log.Fatal(err)
}
log.Printf("size csv_wtf.csv = %d", finfo.Size())
csv_file.Close()
}
this code output data to csv with add extra bytes
2017/04/16 12:37:14 len(s)=16
2017/04/16 12:37:14 size csv_wtf.csv = 29
why encoding/csv add extra bytes when follow my string over range (see https://golang.org/src/encoding/csv/writer.go#L38, https://golang.org/src/encoding/csv/writer.go#L50 and https://golang.org/src/encoding/csv/writer.go#L76)?
could somebody help me find CSV package who don't do it strange conversion ??
This is because CSV format is not suitable for storing raw binary data, which is unlikely to be a valid utf-8 sequence.
What happens is that when csv_writer.Write
iterates a string with range
loop, every time it encounters an invalid utf-8 sequence, the rune r1
gets equal to 65533, which is encoded as 3 bytes: 0xef, 0xbf, 0xbd
.
Illustrative example:
package main
import (
"bytes"
"fmt"
)
func main() {
invalidString := string([]byte{0xff, 0xfe, 0xfd})
var b bytes.Buffer
for _, r := range invalidString {
fmt.Printf("current rune: %v
", r)
b.WriteRune(r)
}
fmt.Printf("total data: %v
", b.Bytes())
}
The output is:
current rune: 65533
current rune: 65533
current rune: 65533
total data: [239 191 189 239 191 189 239 191 189]
So you should either abandon CSV in favour of some other format (suitable for storing binary data), or store UUIDs in their string form.