I am new to Go and for exercise, I am building a crawler that extracts some information from a site, using regular expressions. However, it seems like Go parses the webpage incorrectly. Using the following code:
package main
import (
"fmt"
"net/http"
"io/ioutil"
"regexp"
)
func getPages(url string,reg *regexp.Regexp) int{
resp,_:=http.Get(url)
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
mm:=reg.FindAllSubmatch(body,-1)
fmt.Println("
"+url+"
")
for _,val:=range mm{
fmt.Println(string(val[0]))
fmt.Println(string(val[1]))
fmt.Println(val[1])
}
return 1
}
func main() {
url:="http://www.vinmonopolet.no/vareutvalg/sok?query=*&sort=2&sortMode=0&page=1&filterIds=25&filterValues=Rødvin"
rr:=regexp.MustCompile(`query=\*&sort=2&sortMode=0&page=\d+&filterIds=25&filterValues=\S{1,15}">(\d+)`)
getPages(url,rr)
}
I read the contents of the url, and I receive output of the form:
query=*&sort=2&sortMode=0&page=10&filterIds=25&filterValues=R">10
10
[49 48]
query=*&sort=2&sortMode=0&page=11&filterIds=25&filterValues=R">11
11
[49 49]
query=*&sort=2&sortMode=0&page=355&filterIds=25&filterValues=R">355
355
[51 53 53]
All the values, except the last one are correct. Navigating to http://www.vinmonopolet.no/vareutvalg/sok?query=*&sort=2&sortMode=0&page=1&filterIds=25&filterValues=Rødvin and viewing its source, shows that the last entry should have the value 205, not 355.
Can someone point me in the right direction for solving this issue?
Edit: The regex works as expected. This is not the issue. Also, if you replace the url variable to, e.g.
url:="http://www.vinmonopolet.no/vareutvalg/sok?query=*&sort=2&sortMode=0&page=1&filterIds=25&filterValues=Hvitvin"
which has fewer paginations (138), it seems to work as expected.
Edit 2: I am using go version 1.2.1 on ubuntu 14.04, the standard apt-get install. Edit 3: I compilet go 1.3 and tried using this version with the same results.