使用Go从html解析列表项

I want to extract all list items (content of each <li></li>) with Go. Should I use regexp to get the <li> items or is there any other library for this?

My intention is to get a list or array in Go that contains all list item from a specific html web page. How should I do that?

You likely want to use the golang.org/x/net/html package. It's not in the Go standard packages, but instead in the Go Sub-repositories. (The sub-repositories are part of the Go Project but outside the main Go tree. They are developed under looser compatibility requirements than the Go core.)

There is an example in that documentation that may be similar to what you want.

If you need to stick with the Go standard packages for some reason, then for "typical HTML" you can use encoding/xml.

Both packages tend to use an io.Reader for input. If you have a string or []byte variable you can wrap them with strings.NewReader or bytes.Buffer to get an io.Reader.

For HTML it's more likely you'll come from an http.Response body (make sure to close it when done). Perhaps something like:

    resp, err := http.Get(someURL)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    doc, err := html.parse(resp.Body)
    if err != nil {
        return err
    }
    // Recursively visit nodes in the parse tree
    var f func(*html.Node)
    f = func(n *html.Node) {
        if n.Type == html.ElementNode && n.Data == "a" {
            for _, a := range n.Attr {
                if a.Key == "href" {
                    fmt.Println(a.Val)
                    break
                }
            }
        }
        for c := n.FirstChild; c != nil; c = c.NextSibling {
            f(c)
        }
    }
    f(doc)
}

Of course, parsing fetched web pages won't work for pages that modify their own contents with JavaScript on the client side.

Here's one way I found to solve this.

If you're trying to extract the text after the li element you first find the li element and then move the tokenizer to the very next element which will be the text (hopefully). You may have to use some logic if the next element is an anchor, span, etc.

resp, err := http.Get(url)
if err!=nil{
    log.Fatal(err)
}
defer resp.Body.Close()

z := html.NewTokenizer(bufio.NewReader(resp.Body))
for {
    tt := z.Next()
    switch tt {
    case html.ErrorToken:
        return
    case html.StartTagToken:
        t := z.Token()
        swith t.Data {
        case "li":
            z.Next()
            t = z.Token()
            fmt.Println(t.Data)
        }
    }
}

but really, you should just use github.com/PuerkitoBio/goquery