I'm building a crawler that takes a URL, extracts links from it, and visits each one of them to a certain depth; making a tree of paths on a specific site.
The way I implemented parallelism for this crawler is that I visit each new found URL as soon as it's found like this:
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
wg.Wait()
}
func deduplicate(ch chan string, wg *sync.WaitGroup) {
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
// handle the link and create a variable "links" containing the links found inside the page
wg.Add(len(links))
for _, l := range links {
q <- l}
}
}
This works fine for relatively small sites, but when I run it on a large one with a lot of link everywhere, I start getting one of these two errors on some requests: socket: too many open files
and no such host
(the host is indeed there).
What's the best way to handle this? Should I check for these errors and pause execution when I get them for some time until the other requests are finished? Or specify a maximum number of possible requests at a certain time? (which makes more sense to me but not sure how to code up exactly)
The files being referred to in the error socket: too many open files
includes Threads and sockets (the http requests to load the web pages being scraped). See this question.
The DNS query also most likely fails due being unable to create a file, however the error that is reported is no such host
.
The problem can be fixed in two ways:
1) Increase the maximum number of open file handles
2) Limit the maximum number of concurrent `crawl` calls
1) Is the simplest solution, but might not be ideal as it only postpones the problem until you find a website that has more links that the new limit . For linux use can set this limit with ulimit -n
.
2) Is more a problem of design. We need to limit the number of http requests that can be made concurrently. I have modified the code a little. The most important change is maxGoRoutines. With every scraping call that starts a value is inserted into the channel. Once the channel is full the next call will block until a value is removed from the channel. A value is removed from the channel every time a scraping call finishes.
package main
import (
"fmt"
"sync"
"time"
)
func main() {
link := "https://example.com"
wg := new(sync.WaitGroup)
wg.Add(1)
q := make(chan string)
go deduplicate(q, wg)
q <- link
fmt.Println("waiting")
wg.Wait()
}
//This is the maximum number of concurrent scraping calls running
var MaxCount = 100
var maxGoRoutines = make(chan struct{}, MaxCount)
func deduplicate(ch chan string, wg *sync.WaitGroup) {
seen := make(map[string]bool)
for link := range ch {
// seen is a global variable that holds all seen URLs
if seen[link] {
wg.Done()
continue
}
seen[link] = true
wg.Add(1)
go crawl(link, ch, wg)
}
}
func crawl(link string, q chan string, wg *sync.WaitGroup) {
//This allows us to know when all the requests are done, so that we can quit
defer wg.Done()
links := doCrawl(link)
for _, l := range links {
q <- l
}
}
func doCrawl(link string) []string {
//This limits the maximum number of concurrent scraping requests
maxGoRoutines <- struct{}{}
defer func() { <-maxGoRoutines }()
// handle the link and create a variable "links" containing the links found inside the page
time.Sleep(time.Second)
return []string{link + "a", link + "b"}
}