HTML解析器忽略img标签(Golang)

My task is to find images urls inside an html

The problem

Html parser golang.org/x/net/html as well as github.com/PuerkitoBio/goquery igonores the biggest image on the page http://www.ozon.ru/context/detail/id/34498204/

The question

  • What is wrong in my code
  • Why required img tag with src="" is ignored?
  • Is there are way to get all images from html with go?

Notes:

  • When i used parser written in Swift this image has been found on the page //static2.ozone.ru/multimedia/spare_covers/1013531536.jpg

  • This image tag has been found when i use regex search.

  • This image tag has been found when i use third party service saveallimages.com

  • I tried to use gokogiri but has no success to compile it on my mac. Go get is successful, but Go build stuck forever.

Parsed html page source

This is the html which is result of resp, _ := http.Get(url)

Code:

package main

import (
  "golang.org/x/net/html"
  "log"
  "net/http"
)


func main() {

  url := "http://www.ozon.ru/context/detail/id/34498204/"

  if resp, err := http.Get(url); err == nil {
    defer resp.Body.Close()

    log.Println("Load page complete")

    if resp != nil {
      log.Println("Page response is NOT nil")

      if document, err := html.Parse(resp.Body); err == nil {

        var parser func(*html.Node)
        parser = func(n *html.Node) {
          if n.Type == html.ElementNode && n.Data == "img" {

            var imgSrcUrl, imgDataOriginal string

            for _, element := range n.Attr {
              if element.Key == "src" {
                imgSrcUrl = element.Val
              }
              if element.Key == "data-original" {
                imgDataOriginal = element.Val
              }
            }

            log.Println(imgSrcUrl, imgDataOriginal)
          }

          for c := n.FirstChild; c != nil; c = c.NextSibling {
            parser(c)
          }

        }
        parser(document)
      } else {
        log.Panicln("Parse html error", err)
      }

    } else {
      log.Println("Page response IS nil")
    }
  }

}

This is not a bug but expected behaviour of x/net/html which affects all parsers based on x/net/html.

There are four possible solutions:

  1. Remove <noscript> and </noscript> in HTML so x/net/html would parse its content as expected. Something like:

    package main
    
    import (
        "golang.org/x/net/html"
        "log"
        "net/http"
        "io/ioutil"
        "strings"
    )
    
    func main() {
    
        url := "http://www.ozon.ru/context/detail/id/34498204/"
    
        if resp, err := http.Get(url); err == nil {
            defer resp.Body.Close()
    
            log.Println("Load page complete")
    
            if resp != nil {
                log.Println("Page response is NOT nil")
                // --------------
                data, _ := ioutil.ReadAll(resp.Body)
                resp.Body.Close()
    
                hdata := strings.Replace(string(data), "<noscript>", "", -1)
                hdata = strings.Replace(hdata, "</noscript>", "", -1)
                // --------------
    
                if document, err := html.Parse(strings.NewReader(hdata)); err == nil {
                    var parser func(*html.Node)
                    parser = func(n *html.Node) {
                        if n.Type == html.ElementNode && n.Data == "img" {
    
                            var imgSrcUrl, imgDataOriginal string
    
                            for _, element := range n.Attr {
                                if element.Key == "src" {
                                    imgSrcUrl = element.Val
                                }
                                if element.Key == "data-original" {
                                    imgDataOriginal = element.Val
                                }
                            }
    
                            log.Println(imgSrcUrl, imgDataOriginal)
                        }
    
                        for c := n.FirstChild; c != nil; c = c.NextSibling {
                            parser(c)
                        }
    
                    }
                    parser(document)
                } else {
                    log.Panicln("Parse html error", err)
                }
    
            } else {
                log.Println("Page response IS nil")
            }
        }
    
    }
    
  2. Patch x/net/html with https://github.com/bearburger/net/commit/42ac75393ced8c48137b574278522df1f3fa2cec

  3. Use gokogiri with go 1.4 (I'm pretty sure this is last version supported)

  4. Wait for decision on https://github.com/golang/go/issues/16318 If this is real bug I'll make the pull request.