限制gocolly一次处理有限数量的网址

I am trying to use gocolly's Parallelism setting to throttle scraping a maximum number of URLs at a time.

Using the code I've pasted below, I am getting this output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv

Which shows that the visits are not blocking with the max number of threads given. When adding more URLs, they are sent all together resulting in a ban from the server.

How can I configure the library to get the following output:

Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=sQuKLv

Here is the code:

const (
    letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
    URL = "https://www.google.com/search?q="
)

func RandStringBytes(n int) chan string {
    out := make(chan string)
    quit := make(chan int)

    go func() { 
        for i := 1; i <= 5; i++ {
            b := make([]byte, n)
            for i := range b {
                b[i] = letterBytes[rand.Intn(len(letterBytes))]
            }
            out <- string(b)
        }
        close(out)
        quit <- 0
    }()
    return out
}

func main() {
    c := RandStringBytes(6) 
    collector := colly.NewCollector(
        colly.AllowedDomains("www.google.com"),
        colly.Async(true),
        colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
    )   

    collector.Limit(&colly.LimitRule{
        DomainRegexp: "www.google.com",
        Parallelism: 2,
        RandomDelay: 5 * time.Second,
    })
    collector.OnResponse(func(r *colly.Response) {
        url := r.Ctx.Get("url")
        fmt.Println("Done visiting", url)
    })
    collector.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
        fmt.Println("Visiting", r.URL.String())
    })
    collector.OnError(func(r *colly.Response, err error) {
        fmt.Println(err)
    })

    for w := range c {
        collector.Visit(URL+w)
    }

    collector.Wait()
}


Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv

OnRequest is done before the request is actually sent to the server. Your debug statement is misleading: fmt.Println("Visiting", r.URL.String()) should probably be: fmt.Println("Preparing request for:", r.URL.String()).

I thought your question was interesting, so I set up a local test case with python's http.server like so:

$ cd $(mktemp -d) # make temp dir
$ for n in {0..99}; do touch $n; done # make 100 empty files
$ python3 -m http.server # start up test server

Then modify your code above:

package main

import (
    "fmt"
    "strconv"
    "time"

    "github.com/gocolly/colly"
)

const URL = "http://127.0.0.1:8000/"

func main() {
    collector := colly.NewCollector(
        colly.AllowedDomains("127.0.0.1:8000"),
        colly.Async(true),
        colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
    )

    collector.Limit(&colly.LimitRule{
        DomainRegexp:  "127.0.0.1:8000",
        Parallelism: 2,
        Delay:       5 * time.Second,
    })

    collector.OnResponse(func(r *colly.Response) {
        url := r.Ctx.Get("url")
        fmt.Println("Done visiting", url)
    })

    collector.OnRequest(func(r *colly.Request) {
        r.Ctx.Put("url", r.URL.String())
        fmt.Println("Creating request for:", r.URL.String())
    })

    collector.OnError(func(r *colly.Response, err error) {
        fmt.Println(err)
    })

    for i := 0; i < 100; i++ {
        collector.Visit(URL + strconv.Itoa(i))
    }

    collector.Wait()
}

Note that I changed the RandomDelay to a regular one, which makes things easier to reason about for a test case, and I changed the debug statement for OnRequest.

Now if you go run this file, you'll see that:

it immediately prints Creating request for: http://127.0.0.1:8000/ + a number, 100 times
it prints Done visiting http://127.0.0.1:8000/ + a number, twice
the Python HTTP server prints 2 GET requests, 1 for each of the numbers in #2
it pauses 5 seconds
steps #2 - #4 repeat for the remaining numbers

So it looks to me like colly is behaving as intended. If you're still getting rate limit errors that you don't expect, consider trying to verify that your limit rule is matching the domain.