I'm parsing an xml rss feed from a couple of different sources and I want to find the images in the html.
I did some research and I found a regex that I think might work
/<img[^>]+src="?([^"\s]+)"?\s*\/>/g
but I have trouble using it in go. It gives me errors because I don't know how to make it search with that expression.
I tried using it as a string, it doesn't escape properly with single or with double quotes. I tried using it just like that, bare, and it gives me an error.
Any ideas?
Using a proper html parser is always better for parsing html, however a cheap / hackish regex can also work fine, here's an example:
var imgRE = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
// if your img's are properly formed with doublequotes then use this, it's more efficient.
// var imgRE = regexp.MustCompile(`<img[^>]+\bsrc="([^"]+)"`)
func findImages(htm string) []string {
imgs := imgRE.FindAllStringSubmatch(htm, -1)
out := make([]string, len(imgs))
for i := range out {
out[i] = imgs[i][1]
}
return out
}
Ah so, sorry,Not worked with Go before but this seems work. tryed at
https://tour.golang.org/welcome/1
.
package main
import (
"fmt"
"regexp"
)
func main() {
var myString = `<img src='img1single.jpg'><img src="img2double.jpg">`
var myRegex = regexp.MustCompile(`<img[^>]+\bsrc=["']([^"']+)["']`)
var imgTags = myRegex.FindAllStringSubmatch(myString, -1)
out := make([]string, len(imgTags))
for i := range out {
fmt.Println(imgTags[i][1])
}
}
I suggest to use htmlagility to parse any dom/xml kind a.
Read document by;
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
Parse by Xpath definition RegX fine but group ext. issues makes job complex
doc.DocumentNode.SelectSingleNode(XPath here)
or
doc.DocumentNode.SelectNodes("//img") // this should give all img tags
like.
i suggest this becouse it seems rss serves some html content ;) So get xml, parse with XMLDoc get html content that you need then get all images by this. For open answer.
after comment just need regex i think ; my pattern is
<img.+?src=[\"'](.+?)[\"'].*?>
for input
<img src='img1single.jpg'>
<img src="img2double.jpg">
and result seems fine in .net you must get by foreach via
.Groups[1].Value
regards.