通过UnMarshal和MarshalIndent往返XML

I wanted to quickly create a utility to format any XML data using golang's xml.MarshalIndent()

However this code

package main

import (
    "encoding/xml"
    "fmt"
)

func main() {

    type node struct {
        XMLName  xml.Name
        Attrs    []xml.Attr `xml:",attr"`
        Text     string     `xml:",chardata"`
        Children []node     `xml:",any"`
    }

    x := node{}
    _ = xml.Unmarshal([]byte(doc), &x)
    buf, _ := xml.MarshalIndent(x, "", "  ") // prefix, indent

    fmt.Println(string(buf))
}

const doc string = `<book lang="en">
     <title>The old man and the sea</title>
       <author>Hemingway</author>
</book>`

Produces

<book>&#xA;     &#xA;       &#xA;
  <title>The old man and the sea</title>
  <author>Hemingway</author>
</book>

Notice the extraneous matter after the <book> opening element.

  • I've lost my attributes - why?
  • I'd like to avoid gathering spurious inter-element chardata - How?

For starters, you aren't using the attribute struct tag correctly, so that's a simple fix for that.

From https://godoc.org/encoding/xml#Unmarshal

  • If the XML element has an attribute not handled by the previous rule and the struct has a field with an associated tag containing ",any,attr", Unmarshal records the attribute value in the first such field.

Second, because the tag xml:",chardata" doesn't even pass that field through UnmarshalXML of the xml.Unmarshaller interface, you can't simply create a new type for Text and implement that interface for it as noted in the same docs. (Note that any type other than []byte or string will force an error)

  • If the XML element contains character data, that data is accumulated in the first struct field that has tag ",chardata". The struct field may have type []byte or string. If there is no such field, the character data is discarded.

Thus, the easiest way to deal with the unwanted characters is after the fact by just replacing them.

Full code example here: https://play.golang.org/p/VSDskgfcLng

var Replacer = strings.NewReplacer("&#xA;","","&#x9;","","
","","\t","")

func recursiveReplace(n *Node) {
    n.Text = Replacer.Replace(n.Text)
    for i := range n.Children {
        recursiveReplace(&n.Children[i])
    }
}

One could theoretically implement the xml.Unmarshaller interface for Node, but then you have to not only deal with manual xml parsing, but also the fact that it is a recursive structure. It's easiest to just remove the unwanted characters after the fact.