goquery-从一个html标签提取文本并将其添加到下一个标签

Yeah, sorry that the title explains nothing. I'll need to use an example.

This is a continuation of another question I posted which solved one problem but not all of them. I've put most of the background info from that question into this one. Also, I've only been looking into Go for about 5 days (and I only started learning code a couple months ago), so I'm 90% sure that I'm close to figuring out what I want and that the problem is that I've got some silly syntax mistakes.

Situation

I'm trying to use goquery to parse a webpage. (Eventually I want to put some of the data in a database). Here's what it looks like:

<html>
    <body>
        <h1>
            <span class="text">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="text">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

Objective

I'd like to:

  1. Extract the content of <h1..."text".
  2. Insert (and concatenate) this extracted content into the content of <p..."text".
  3. Only do this for the <p> tag that immediately follows the <h1> tag.
  4. Do this for all of the <h1> tags on the page.

Once again, an example explains ^this better. This is what I want it to look like:

<html>
    <body>
        <p>
            <span class="text">Go totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <p>
            <span class="text">debugger should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle</span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

Solution Attempts

Because distinguishing further the <h1> tags from the <p> tags would provide more parsing options, I've figured out how to change the class attributes of the <h1> tags to this:

<html>
    <body>
        <h1>
            <span class="title">Go </span>
        </h1>
        <p>
            <span class="text">totally </span>
            <span class="post">kicks </span>
        </p>
        <p>
            <span class="text">hacks </span>
            <span class="post">its </span>
        </p>
        <h1>
            <span class="title">debugger </span>
        </h1>
        <p>
            <span class="text">should </span>
            <span class="post">be </span>
        </p>
        <p>
            <span class="text">called </span>
            <span class="post">ogle </span>
        </p>
        <h3>
            <span class="statement">true</span>
        </h3>
    </body>
<html>

with this code:

html_code := strings.NewReader(`
code_example_above
`)
doc, _ := goquery.NewDocumentFromReader(html_code)
doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
    }
})

I know that I can select the <p..."text" following the <h1..."title" with either doc.Find("h1+p") or s.Next() inside the doc.Find("h1").Each function:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        fmt.Println(class, s.Text())
        fmt.Println(s.Next().Text())
    }
})

I can't figure out how to insert the text from <h1..."title" to <p..."text". I've tried using quite a few variations of s.After(), s.Before(), and s.Append(), e.g., like this:

doc.Find("h1").Each(func(i int, s *goquery.Selection) {
    s.SetAttr("class", "title")
    class, _ := s.Attr("class")
    if class == "title" {
        s.After(s.Text())
        fmt.Println(s.Next().Text())
    }
})

but I can't figure out how to do exactly what I want.

If I use s.After(s.Next().Text()) instead, I get this error output:

panic: expected identifier, found 5 instead

goroutine 1 [running]:
code.google.com/p/cascadia.MustCompile(0xc2082f09a0, 0x62, 0x62)
    /home/*/go/src/code.google.com/p/cascadia/selector.go:59 +0x77
github.com/PuerkitoBio/goquery.(*Selection).After(0xc2082ea630, 0xc2082f09a0, 0x62, 0x5)
    /home/*/go/src/github.com/PuerkitoBio/goquery/manipulation.go:18 +0x32
main.func·001(0x0, 0xc2082ea630)
    /home/*/go/test2.go:78 +0x106
github.com/PuerkitoBio/goquery.(*Selection).Each(0xc2082ea600, 0x7cb678, 0x2)
    /home/*/go/src/github.com/PuerkitoBio/goquery/iteration.go:7 +0x173
main.ExampleScrape()
    /home/*/go/test2.go:82 +0x213
main.main()
    /home/*/go/test2.go:175 +0x1b

goroutine 9 [runnable]:
net/http.(*persistConn).readLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:928 +0x9ce
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:660 +0xc9f

goroutine 17 [syscall, locked to thread]:
runtime.goexit()
    /usr/lib/go/src/runtime/asm_amd64.s:2232 +0x1

goroutine 10 [select]:
net/http.(*persistConn).writeLoop(0xc208047ef0)
    /usr/lib/go/src/net/http/transport.go:945 +0x41d
created by net/http.(*Transport).dialConn
    /usr/lib/go/src/net/http/transport.go:661 +0xcbc
exit status 2

(The lines of my script don't match the lines of the examples above, but "line 72" of my script contains the code s.After(s.Next().Text()). I don't know what exactly panic: expected identifier, found 5 instead means.)

Summary

In summary, my problem is that I can't quite wrap my head around how to use goquery to add text to a tag.

I think I'm close. Would any gopher Jedis be able and willing to help this padawan?

Something like this code does the job, it finds all <h1> nodes, then all <span> nodes inside these <h1> nodes, looking for one with class text. Then it gets the next element to the <h1> node, if it is a <p>, that has inside a <span>, then it replaces this last <span> with a new <span> with the new text and removes the <h1>.

I wonder if it's possible to create nodes using goquery without writing html...

package main

import (
    "fmt"
    "strings"

    "github.com/PuerkitoBio/goquery"
)

var htmlCode string = `<html>
...
<html>`

func main() {
    doc, _ := goquery.NewDocumentFromReader(strings.NewReader((htmlCode)))
    doc.Find("h1").Each(func(i int, h1 *goquery.Selection) {
        h1.Find("span").Each(func(j int, s *goquery.Selection) {
            if s.HasClass("text") {
                if p := h1.Next(); p != nil {
                    if ps := p.Children().First(); ps != nil && ps.HasClass("text") {
                        ps.ReplaceWithHtml(
                            fmt.Sprintf("<span class=\"text\">%s%s</span>)", s.Text(), ps.Text()))
                        h1.Remove()
                    }
                }
            }
        })
    })
    htmlResult, _ := doc.Html()
    fmt.Println(htmlResult)
}