I am going through the golang tour and working on the final exercise to change a web crawler to crawl in parallel and not repeat a crawl ( http://tour.golang.org/#73 ). All I have changed is the crawl function.
var used = make(map[string]bool)
func Crawl(url string, depth int, fetcher Fetcher) {
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("
found: %s %q
", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
Crawl(u, depth-1, fetcher)
}
}
return
}
In order to make it concurrent I added the go command in front of the call to the function Crawl, but instead of recursively calling the Crawl function the program only finds the "http://golang.org/" page and no other pages.
Why doesn't the program work when I add the go command to the call of the function Crawl?
The problem seems to be, that your process is exiting before all URLs can be followed by the crawler. Because of the concurrency, the main()
procedure is exiting before the workers are finished.
To circumvent this, you could use sync.WaitGroup
:
func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
defer wg.Done()
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("
found: %s %q
", url, body)
for _,u := range urls {
if used[u] == false {
used[u] = true
wg.Add(1)
go Crawl(u, depth-1, fetcher, wg)
}
}
return
}
And call Crawl
in main
as follows:
func main() {
wg := &sync.WaitGroup{}
Crawl("http://golang.org/", 4, fetcher, wg)
wg.Wait()
}
Here's an approach, again using sync.WaitGroup but wrapping the fetch function in a anonymous goroutine. To make the url map thread safe (meaning parallel threads can't access and change values at the same time) one should wrap the url map in a new type with a sync.Mutex type included i.e. the fetchedUrls
type in my example and use the Lock
and Unlock
methods while the map is being searched/updated.
type fetchedUrls struct {
urls map[string]bool
mux sync.Mutex
}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, used fetchedUrls, wg *sync.WaitGroup) {
if depth <= 0 {
return
}
used.mux.Lock()
if used.urls[url] == false {
used.urls[url] = true
wg.Add(1)
go func() {
defer wg.Done()
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q
", url, body)
for _, u := range urls {
Crawl(u, depth-1, fetcher, used, wg)
}
return
}()
}
used.mux.Unlock()
return
}
func main() {
wg := &sync.WaitGroup{}
used := fetchedUrls{urls: make(map[string]bool)}
Crawl("https://golang.org/", 4, fetcher, used, wg)
wg.Wait()
}
Output:
found: https://golang.org/ "The Go Programming Language"
not found: https://golang.org/cmd/
found: https://golang.org/pkg/ "Packages"
found: https://golang.org/pkg/os/ "Package os"
found: https://golang.org/pkg/fmt/ "Package fmt"
Program exited.