I've just started learning go, and have been working through the tour. The last exercise is to edit a web crawler to crawl in parallel and without repeats.
Here is the link to the exercise: http://tour.golang.org/#70
Here is the code. I only changed the crawl and the main function. So I'll just post those to keep it neat.
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
var used = make(map[string]bool)
var urlchan = make(chan string)
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// Done: Don't fetch the same URL twice.
// This implementation doesn't do either:
done := make(chan bool)
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("
found: %s %q
", url, body)
go func() {
for _, i := range urls {
urlchan <- i
}
done <- true
}()
for u := range urlchan {
if used[u] == false {
used[u] = true
go Crawl(u, depth-1, fetcher)
}
if <-done == true {
break
}
}
return
}
func main() {
used["http://golang.org/"] = true
Crawl("http://golang.org/", 4, fetcher)
}
The problem is that when I run the program the crawler stops after printing
not found: http://golang.org/cmd/
This only happens when I try to make the program run in parallel. If I have it run linearly then all the urls are found correctly.
Note: If I am not doing this right (parallelism I mean) then I apologise.
main()
func, returns, all others go routine would be killed immediately.Crawl()
seems like recursive, however it is not, which means it would return immediately, not awaiting for other Crawl()
routines. And you know that if the first Crawl()
, called by main()
, returns, the main()
func regards its mission fulfilled.main()
func wait until the last Crawl()
returns. The sync
package, or a chan
would help.You could probably take a look at the last solution of this, which I did months ago:
var store map[string]bool
func Krawl(url string, fetcher Fetcher, Urls chan []string) {
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
} else {
fmt.Printf("found: %s %q
", url, body)
}
Urls <- urls
}
func Crawl(url string, depth int, fetcher Fetcher) {
Urls := make(chan []string)
go Krawl(url, fetcher, Urls)
band := 1
store[url] = true // init for level 0 done
for i := 0; i < depth; i++ {
for band > 0 {
band--
next := <- Urls
for _, url := range next {
if _, done := store[url] ; !done {
store[url] = true
band++
go Krawl(url, fetcher, Urls)
}
}
}
}
return
}
func main() {
store = make(map[string]bool)
Crawl("http://golang.org/", 4, fetcher)
}