I am trying to use gocolly's Parallelism setting to throttle scraping a maximum number of URLs at a time.
Using the code I've pasted below, I am getting this output:
Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv
Which shows that the visits are not blocking with the max number of threads given. When adding more URLs, they are sent all together resulting in a ban from the server.
How can I configure the library to get the following output:
Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=sQuKLv
Here is the code:
const (
letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
URL = "https://www.google.com/search?q="
)
func RandStringBytes(n int) chan string {
out := make(chan string)
quit := make(chan int)
go func() {
for i := 1; i <= 5; i++ {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Intn(len(letterBytes))]
}
out <- string(b)
}
close(out)
quit <- 0
}()
return out
}
func main() {
c := RandStringBytes(6)
collector := colly.NewCollector(
colly.AllowedDomains("www.google.com"),
colly.Async(true),
colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
)
collector.Limit(&colly.LimitRule{
DomainRegexp: "www.google.com",
Parallelism: 2,
RandomDelay: 5 * time.Second,
})
collector.OnResponse(func(r *colly.Response) {
url := r.Ctx.Get("url")
fmt.Println("Done visiting", url)
})
collector.OnRequest(func(r *colly.Request) {
r.Ctx.Put("url", r.URL.String())
fmt.Println("Visiting", r.URL.String())
})
collector.OnError(func(r *colly.Response, err error) {
fmt.Println(err)
})
for w := range c {
collector.Visit(URL+w)
}
collector.Wait()
}
Visiting https://www.google.com/search?q=GrkZmM
Visiting https://www.google.com/search?q=eYSGmF
Visiting https://www.google.com/search?q=MtYvWU
Visiting https://www.google.com/search?q=yMDfIa
Visiting https://www.google.com/search?q=sQuKLv
Done visiting https://www.google.com/search?q=MtYvWU
Done visiting https://www.google.com/search?q=GrkZmM
Done visiting https://www.google.com/search?q=eYSGmF
Done visiting https://www.google.com/search?q=yMDfIa
Done visiting https://www.google.com/search?q=sQuKLv
OnRequest
is done before the request is actually sent to the server. Your debug statement is misleading: fmt.Println("Visiting", r.URL.String())
should probably be: fmt.Println("Preparing request for:", r.URL.String())
.
I thought your question was interesting, so I set up a local test case with python's http.server
like so:
$ cd $(mktemp -d) # make temp dir
$ for n in {0..99}; do touch $n; done # make 100 empty files
$ python3 -m http.server # start up test server
Then modify your code above:
package main
import (
"fmt"
"strconv"
"time"
"github.com/gocolly/colly"
)
const URL = "http://127.0.0.1:8000/"
func main() {
collector := colly.NewCollector(
colly.AllowedDomains("127.0.0.1:8000"),
colly.Async(true),
colly.UserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"),
)
collector.Limit(&colly.LimitRule{
DomainRegexp: "127.0.0.1:8000",
Parallelism: 2,
Delay: 5 * time.Second,
})
collector.OnResponse(func(r *colly.Response) {
url := r.Ctx.Get("url")
fmt.Println("Done visiting", url)
})
collector.OnRequest(func(r *colly.Request) {
r.Ctx.Put("url", r.URL.String())
fmt.Println("Creating request for:", r.URL.String())
})
collector.OnError(func(r *colly.Response, err error) {
fmt.Println(err)
})
for i := 0; i < 100; i++ {
collector.Visit(URL + strconv.Itoa(i))
}
collector.Wait()
}
Note that I changed the RandomDelay
to a regular one, which makes things easier to reason about for a test case, and I changed the debug statement for OnRequest
.
Now if you go run
this file, you'll see that:
Creating request for: http://127.0.0.1:8000/
+ a number, 100 timesDone visiting http://127.0.0.1:8000/
+ a number, twiceGET
requests, 1 for each of the numbers in #2So it looks to me like colly is behaving as intended. If you're still getting rate limit errors that you don't expect, consider trying to verify that your limit rule is matching the domain.