从net / html令牌生成器获取流中的当前位置

I'm trying to figure out if there's a way to get the current character position of a tag using the golang.org/x/net/html tokenizer library?

Simplified code looks like:

func LookForForm(body string) {
    reader := strings.NewReader(body)
    tokenizer := html.NewTokenizer(reader)
    idx := 0
    lastIdx := 0
    for {
        token := tokenizer.Next()
        lastIdx = idx
        idx = int(reader.Size()) - int(reader.Len())
        switch token {
        case html.ErrorToken:
            return
        case html.StartTagToken:
            t := tokenizer.Token()
            tagName := strings.ToLower(t.Data)
            if tagName == "form" {
                fmt.Printf("found at form at %d
", lastIdx)
                return
            }
        }
    }
}

This doesn't work (I think) because reader is not reading character-by-character but by chunks so my calculation of Size - Len is invalid. tokenizer maintains two private span structs ( https://github.com/golang/net/blob/master/html/token.go line 147) but I am unaware of how to access them.

One possible solution that just occurred to me is to make a "reader" that only reads a single character at a time so my Size and Len calculations are always correct. But, that seems like a hack and any suggestions would be appreciated.

A non-buffering reader ended up working ok for me. The implementation of the reader looks something like:

package rule

import (
    "errors"
    "io"
    "unicode/utf8"
)

type Reader struct {
    s        string
    i        int64
    z        int64
    prevRune int64 // index of the previously read rune or -1
}

func (r *Reader) String() string {
    return r.s
}

func (r *Reader) Len() int {
    if r.i >= r.z {
        return 0
    }
    return int(r.z - r.i)
}


func (r *Reader) Size() int64 {
    return r.z 
}


func (r *Reader) Pos() int64 {
    return r.i
}


func (r *Reader) Read(b []byte) (int, error) {
    if r.i >= r.z {
        return 0, io.EOF
     }

    r.prevRune = -1
    b[0] = r.s[r.i]
    r.i += 1
    return 1, nil
}

Then the loop for the tokenizer is fairly easy to calculate:

    reader := NewReader(body)
    tokenizer := html.NewTokenizer(reader)
    idx := 0
    lastIdx := 0
tokenLoop:
    for {
        token := tokenizer.Next()
        switch token {
        case html.ErrorToken:
            break tokenLoop
        case html.EndTagToken, html.TextToken, html.CommentToken, html.SelfClosingTagToken:
            lastIdx = int(reader.Pos())
        case html.StartTagToken:
            t := tokenizer.Token()
            tagName := strings.ToLower(t.Data)
            idx = int(reader.Pos())
            if tagName == "form" {
                fmt.Printf("found at form at %d
", lastIdx)
                return
            }
        }
    }

You might be able to accomplish what you are trying to do (not what you want) with careful arithmetic using Tokenizer's Buffered method which returns the slice of bytes currently in buffer that have yet been tokenized. But I don't think you will get what you wanted, as <div><form></form></div> would probably buffer the whole string before give you the first div token. In that case the size of the buffered content is not helpful in calculating the solution.

Tokenizing mark up lang with nested structure will almost always need to buffer the input to work. the private span attribute should be quite useless as it is only a reference in it's buffer, not absolute position from the reader.

Since the html Tokenizer is not providing an API to access the raw position of a tag in the original data, to get want you wanted I probably would just do a strings.Index or bytes.Index on the raw buffer of the token to get the position:

strings.Index(body, string(tokenizer.Raw()))