I'm trying to figure out if there's a way to get the current character position of a tag using the golang.org/x/net/html
tokenizer library?
Simplified code looks like:
func LookForForm(body string) {
reader := strings.NewReader(body)
tokenizer := html.NewTokenizer(reader)
idx := 0
lastIdx := 0
for {
token := tokenizer.Next()
lastIdx = idx
idx = int(reader.Size()) - int(reader.Len())
switch token {
case html.ErrorToken:
return
case html.StartTagToken:
t := tokenizer.Token()
tagName := strings.ToLower(t.Data)
if tagName == "form" {
fmt.Printf("found at form at %d
", lastIdx)
return
}
}
}
}
This doesn't work (I think) because reader is not reading character-by-character but by chunks so my calculation of Size - Len is invalid. tokenizer
maintains two private span
structs ( https://github.com/golang/net/blob/master/html/token.go line 147) but I am unaware of how to access them.
One possible solution that just occurred to me is to make a "reader" that only reads a single character at a time so my Size
and Len
calculations are always correct. But, that seems like a hack and any suggestions would be appreciated.
A non-buffering reader ended up working ok for me. The implementation of the reader looks something like:
package rule
import (
"errors"
"io"
"unicode/utf8"
)
type Reader struct {
s string
i int64
z int64
prevRune int64 // index of the previously read rune or -1
}
func (r *Reader) String() string {
return r.s
}
func (r *Reader) Len() int {
if r.i >= r.z {
return 0
}
return int(r.z - r.i)
}
func (r *Reader) Size() int64 {
return r.z
}
func (r *Reader) Pos() int64 {
return r.i
}
func (r *Reader) Read(b []byte) (int, error) {
if r.i >= r.z {
return 0, io.EOF
}
r.prevRune = -1
b[0] = r.s[r.i]
r.i += 1
return 1, nil
}
Then the loop for the tokenizer is fairly easy to calculate:
reader := NewReader(body)
tokenizer := html.NewTokenizer(reader)
idx := 0
lastIdx := 0
tokenLoop:
for {
token := tokenizer.Next()
switch token {
case html.ErrorToken:
break tokenLoop
case html.EndTagToken, html.TextToken, html.CommentToken, html.SelfClosingTagToken:
lastIdx = int(reader.Pos())
case html.StartTagToken:
t := tokenizer.Token()
tagName := strings.ToLower(t.Data)
idx = int(reader.Pos())
if tagName == "form" {
fmt.Printf("found at form at %d
", lastIdx)
return
}
}
}
You might be able to accomplish what you are trying to do (not what you want) with careful arithmetic using Tokenizer's Buffered method which returns the slice of bytes currently in buffer that have yet been tokenized. But I don't think you will get what you wanted, as <div><form></form></div>
would probably buffer the whole string before give you the first div token. In that case the size of the buffered content is not helpful in calculating the solution.
Tokenizing mark up lang with nested structure will almost always need to buffer the input to work. the private span attribute should be quite useless as it is only a reference in it's buffer, not absolute position from the reader.
Since the html Tokenizer is not providing an API to access the raw position of a tag in the original data, to get want you wanted I probably would just do a strings.Index or bytes.Index on the raw buffer of the token to get the position:
strings.Index(body, string(tokenizer.Raw()))