My task is to find images urls inside an html
The problem
Html parser golang.org/x/net/html
as well as github.com/PuerkitoBio/goquery
igonores the biggest image on the page http://www.ozon.ru/context/detail/id/34498204/
The question
img
tag with src=""
is ignored?Notes:
When i used parser written in Swift this image has been found on the page //static2.ozone.ru/multimedia/spare_covers/1013531536.jpg
This image tag has been found when i use regex search.
This image tag has been found when i use third party service saveallimages.com
I tried to use gokogiri but has no success to compile it on my mac. Go get
is successful, but Go build
stuck forever.
Parsed html page source
This is the html which is result of resp, _ := http.Get(url)
Code:
package main
import (
"golang.org/x/net/html"
"log"
"net/http"
)
func main() {
url := "http://www.ozon.ru/context/detail/id/34498204/"
if resp, err := http.Get(url); err == nil {
defer resp.Body.Close()
log.Println("Load page complete")
if resp != nil {
log.Println("Page response is NOT nil")
if document, err := html.Parse(resp.Body); err == nil {
var parser func(*html.Node)
parser = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "img" {
var imgSrcUrl, imgDataOriginal string
for _, element := range n.Attr {
if element.Key == "src" {
imgSrcUrl = element.Val
}
if element.Key == "data-original" {
imgDataOriginal = element.Val
}
}
log.Println(imgSrcUrl, imgDataOriginal)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
parser(c)
}
}
parser(document)
} else {
log.Panicln("Parse html error", err)
}
} else {
log.Println("Page response IS nil")
}
}
}
This is not a bug but expected behaviour of x/net/html
which affects all parsers based on x/net/html
.
There are four possible solutions:
Remove <noscript>
and </noscript>
in HTML so x/net/html
would parse its content as expected. Something like:
package main
import (
"golang.org/x/net/html"
"log"
"net/http"
"io/ioutil"
"strings"
)
func main() {
url := "http://www.ozon.ru/context/detail/id/34498204/"
if resp, err := http.Get(url); err == nil {
defer resp.Body.Close()
log.Println("Load page complete")
if resp != nil {
log.Println("Page response is NOT nil")
// --------------
data, _ := ioutil.ReadAll(resp.Body)
resp.Body.Close()
hdata := strings.Replace(string(data), "<noscript>", "", -1)
hdata = strings.Replace(hdata, "</noscript>", "", -1)
// --------------
if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
var parser func(*html.Node)
parser = func(n *html.Node) {
if n.Type == html.ElementNode && n.Data == "img" {
var imgSrcUrl, imgDataOriginal string
for _, element := range n.Attr {
if element.Key == "src" {
imgSrcUrl = element.Val
}
if element.Key == "data-original" {
imgDataOriginal = element.Val
}
}
log.Println(imgSrcUrl, imgDataOriginal)
}
for c := n.FirstChild; c != nil; c = c.NextSibling {
parser(c)
}
}
parser(document)
} else {
log.Panicln("Parse html error", err)
}
} else {
log.Println("Page response IS nil")
}
}
}
Patch x/net/html
with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec
Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)
Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.