I wanted to quickly create a utility to format any XML data using golang's xml.MarshalIndent()
However this code
package main
import (
"encoding/xml"
"fmt"
)
func main() {
type node struct {
XMLName xml.Name
Attrs []xml.Attr `xml:",attr"`
Text string `xml:",chardata"`
Children []node `xml:",any"`
}
x := node{}
_ = xml.Unmarshal([]byte(doc), &x)
buf, _ := xml.MarshalIndent(x, "", " ") // prefix, indent
fmt.Println(string(buf))
}
const doc string = `<book lang="en">
<title>The old man and the sea</title>
<author>Hemingway</author>
</book>`
Produces
<book>
 
 

<title>The old man and the sea</title>
<author>Hemingway</author>
</book>
Notice the extraneous matter after the <book>
opening element.
For starters, you aren't using the attribute struct tag correctly, so that's a simple fix for that.
From https://godoc.org/encoding/xml#Unmarshal
- If the XML element has an attribute not handled by the previous rule and the struct has a field with an associated tag containing ",any,attr", Unmarshal records the attribute value in the first such field.
Second, because the tag xml:",chardata"
doesn't even pass that field through UnmarshalXML
of the xml.Unmarshaller
interface, you can't simply create a new type for Text
and implement that interface for it as noted in the same docs. (Note that any type other than []byte or string will force an error)
- If the XML element contains character data, that data is accumulated in the first struct field that has tag ",chardata". The struct field may have type []byte or string. If there is no such field, the character data is discarded.
Thus, the easiest way to deal with the unwanted characters is after the fact by just replacing them.
Full code example here: https://play.golang.org/p/VSDskgfcLng
var Replacer = strings.NewReplacer("
","","	","","
","","\t","")
func recursiveReplace(n *Node) {
n.Text = Replacer.Replace(n.Text)
for i := range n.Children {
recursiveReplace(&n.Children[i])
}
}
One could theoretically implement the xml.Unmarshaller
interface for Node
, but then you have to not only deal with manual xml parsing, but also the fact that it is a recursive structure. It's easiest to just remove the unwanted characters after the fact.