I'm trying to parse this html with the help of goquery
. I can't figure out how to parse the string "The string I need" while throwing away everything else.
<div class="outter-class">
<h1 class="inner-class">
The string I need
<span class="other-class" >Some value I don't need</span>
<span class="other-class2" title="sometitle"></span>
</h1>
<div class="other-class3">
<h3>Some heading i don't need</h3>
</div>
</div>
I tried to use something like: https://stackoverflow.com/a/8851526/989919 by adapting it to goquery like that:
test := s.Clone().Children().Empty().End().Text()
fmt.Println(test.Text())
But that doesn't work. I tried a lot of different variants from the API but I can't figure it out.
The way I got it to work was with:
// End() lets us jump back to the h1 selection to get its text
text := doc.Find("h1").Children().Remove().End().Text()
text = strings.TrimSpace(text)
fmt.Println(text)
Output:
The string I need
The code removes the child nodes (the span elements) from the h1 element to get the correct output. There may be an easier way of doing this, but it works. :)
How about:
doc.Find(".outter-class .inner-class").Children().First().Text()
Best way I've found to accomplish this:
text := dom.Find(".inner-class").Nodes[0].FirstChild.Data
I've spent quite a bit of time using the HTML parsing library under goquery so this doesn't really seem hacky to me, but it might to some.