I'm using goyaml
as a YAML beautifier. By loading and dumping a YAML file, I can source-format it. I unmarshal the data from a YAML source file into a struct, marshal those bytes, and write the bytes to an output file. But the process morphs my Unicode strings into the literal version of the quoted strings, and I don't know how to reverse it.
Example input subtitle.yaml
:
line: 你好
I've stripped everything down to the smallest reproducible problem. Here's the code, using _
to catch errors which don't pop-up:
package main
import (
"io/ioutil"
//"unicode/utf8"
//"fmt"
"gopkg.in/yaml.v1"
)
type Subtitle struct {
Line string
}
func main() {
filename := "subtitle.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
//for len(out) > 0 { // For debugging, see what the runes are
// r, size := utf8.DecodeRune(out)
// fmt.Printf("%c ", r)
// out = out[size:]
//}
_ = ioutil.WriteFile(filename, out, 0644)
}
Actual output subtitle.yaml
:
line: "\u4F60\u597D"
I want to reverse the weirdness in goyaml
after I get the variable out
.
The commented-out rune-printing code block, which adds spaces between runes for clarity, outputs the following. It shows that Unicode runes like 你
aren't being decoded, but treated literally:
l i n e : " \ u 4 F 6 0 \ u 5 9 7 D "
How can I unquote out
, before writing it to the output file, so that the output looks like the input (albeit beautified)?
Desired output subtitle.yaml
:
line: "你好"
Temporary Solution
I've filed https://github.com/go-yaml/yaml/issues/11. In the meantime, @bobince's tip on yaml_emitter_set_unicode
was helpful in unconvering the problem. It was defined as a C binding but never called (or given an option to set it)! I changed encode.go
and added yaml_emitter_set_unicode(&e.emitter, true)
to line 20, and everything works as expected. It would be better to make it optional, but that would require a change in the Marshal API.
Had a similar issue and could apply this to circumvent the bug in goyaml.Marshal(). (*Regexp) ReplaceAllFunc is your friend which you can use to expand the escaped Unicode runes in the byte array. A little bit too dirty for production maybe, but works for the example ;-)
package main
import (
"io/ioutil"
"unicode/utf8"
"regexp"
"strconv"
"launchpad.net/goyaml"
)
type Subtitle struct {
Line string
}
var reFind = regexp.MustCompile(`^\s*[^\s\:]+\:\s*".*\\u.*"\s*$`)
var reFindU = regexp.MustCompile(`\\u[0-9a-fA-F]{4}`)
func expandUnicodeInYamlLine(line []byte) []byte {
// TODO: restrict this to the quoted string value
return reFindU.ReplaceAllFunc(line, expandUnicodeRune)
}
func expandUnicodeRune(esc []byte) []byte {
ri, _:= strconv.ParseInt(string(esc[2:]), 16, 32)
r := rune(ri)
repr := make([]byte, utf8.RuneLen(r))
utf8.EncodeRune(repr, r)
return repr
}
func main() {
filename := "subtitle.yaml"
filenameOut := "subtitleout.yaml"
in, _ := ioutil.ReadFile(filename)
var subtitle Subtitle
_ = goyaml.Unmarshal(in, &subtitle)
out, _ := goyaml.Marshal(&subtitle)
out = reFind.ReplaceAllFunc(out, expandUnicodeInYamlLine)
_ = ioutil.WriteFile(filenameOut, out, 0644)
}