有关树木和GC的性能建议

In my attempt to create a CSS3 parser (https://github.com/tdewolff/css) for my minification library, I'm in the process of optimizing the tokenizer, parser and minifier. The tokenizer is pretty fast but I believe the parser could be optimized.

The parser processes the tokens and generates a tree of nodes representing the syntax. All nodes satisfy this interface:

type Node interface {
    Type() NodeType
    String() string
}

Example nodes (see https://github.com/tdewolff/css/blob/master/node.go):

// root
type NodeStylesheet struct {
    NodeType
    Nodes []Node
}

// example node
type NodeRuleset struct {
    NodeType
    SelGroups []*NodeSelectorGroup
    Decls     []*NodeDeclaration
}

// leave (the only end-leave possible)
type NodeToken struct {
    NodeType
    TokenType
    Data string
}

Parsing stylesheet (see https://github.com/tdewolff/css/blob/master/parse.go):

func (p *parser) parseStylesheet() *NodeStylesheet {
    n := NewStylesheet()
    for {
        p.skipWhitespace()
        if p.at(ErrorToken) {
            return n
        }
        if p.at(CDOToken) || p.at(CDCToken) {
            n.Nodes = append(n.Nodes, p.shift())
        } else if cn := p.parseAtRule(); cn != nil {
            n.Nodes = append(n.Nodes, cn)
        } else if cn := p.parseRuleset(); cn != nil {
            n.Nodes = append(n.Nodes, cn)
        } else if cn := p.parseDeclaration(); cn != nil {
            n.Nodes = append(n.Nodes, cn)
        } else if !p.at(ErrorToken) {
            n.Nodes = append(n.Nodes, p.shift())
        }
    }
}

Each node is allocated on the heap and a significant time is spent on the GC and related tasks. Could I reduce that?

Can I put all element in a flat array for instance? Because the elements of the tree are filled sequentially (ie. it can be flattened). What techniques can I use to reduce (small) allocations on the heap?

Update

The minifier is actually not really slow, bootstrap (134kB) is minified in 28ms (NodeJS implementations take atleast 45ms and produce larger files http://goalsmashers.github.io/css-minification-benchmark/). But it would be great if I could squeeze out even more!

I know that some time is spent on []byte -> string casting, but the []byte from the tokenizer needs to be copied anyways because its memory can be reused at any tokenizer.Next() call. Since it needs to be copied anyways, I figured casting to string was better because it made much of the code easier (checking for equality).

I can make a tokenizer variant that keeps the whole file in memory, which is fine for the parser because the parser doesn't stream anyways.

Update 2

I did load the whole file into memory for the parser and replaced all string to '[]byte`, now it's 10% faster! Bootstrap.css now takes 23ms.

I don't think it's worth the hassle to flatten the tree, and it removes some of the flexibility (user able to modify the tree).