使用Go的net / html标记生成器处理格式错误的HTML?

I’ve found that the html.NewTokenizer() doesn’t auto-fix some things. So it’s possible that you can end up with a stray closing tag (html.EndTagToken). So <div></p></div> would be html.StartTagToken, html.EndTagToken, html.EndTagToken.

Is there a recommended solution for handling ignoring/removing/fixing these tags?

My first guess would be manually keeping a []atom.Atom slice and push/pop to the list as you start/end each tag (after comparing the tag to make sure you don’t get an unexpected end tag).

Here is some code to demonstrate the problem:

var err error
htm := `<div><div><p></p></p></div>`

tokenizer := html.NewTokenizer(strings.NewReader(htm))

for {

    if tokenizer.Next() == html.ErrorToken {
        err = tokenizer.Err()
        if err == io.EOF {
            err = nil
        }

        return
    }

    token := tokenizer.Token()

    switch token.Type {
    case html.DoctypeToken:
        continue
    case html.CommentToken:
        continue
    case html.SelfClosingTagToken:
        fmt.Println(token.Data)
        continue
    case html.StartTagToken:
        fmt.Printf("<%s>
", token.Data)

    case html.EndTagToken:
        fmt.Printf("</%s>
", token.Data)

    case html.TextToken:
        continue
    default:
        continue
    }
}

Output:

<div>
<div>
<p>
</p>
</p>
</div>

FWIW, it seems that net/html can fix such issues when you use its Parse method. Here's an example adapted from another SO answer, using your malformed HTML snippet:

package main

import (
    "bytes"
    "fmt"
    "log"
    "strings"

    "golang.org/x/net/html"
)

func main() {
    brokenHtml := `<div><div><p></p></p></div>`

    reader := strings.NewReader(brokenHtml)
    root, err := html.Parse(reader)

    if err != nil {
        log.Fatal(err)
    }

    var b bytes.Buffer
    html.Render(&b, root)
    fixedHtml := b.String()

    fmt.Println(fixedHtml)
}