I've been trying to do all the tour of go tutorials and I'm stuck at the web crawler. I thought I finished it, but the output is inconsistent and I don't have enough concurrency experience to figure out why.
Here's my code:
package main
import (
"fmt"
"sync"
)
type Fetcher interface {
// Fetch returns the body of URL and
// a slice of URLs found on that page.
Fetch(url string) (body string, urls []string, err error)
}
var cache = struct {
fetched map[string]bool
sync.Mutex
}{fetched: make(map[string]bool)}
// Crawl uses fetcher to recursively crawl
// pages starting with url, to a maximum of depth.
func Crawl(url string, depth int, fetcher Fetcher, c chan []string, quit chan int) {
if depth <= 0 {
return
}
go safeVisit(url, c, quit, fetcher)
for {
select {
case <- quit:
return
case u:= <-c:
for _, v:= range u {
go Crawl(v, depth -1, fetcher, c, quit)
}
}
}
}
func main() {
c := make(chan []string)
quit := make(chan int)
Crawl("http://golang.org/", 4, fetcher, c, quit)
}
func safeVisit(url string, c chan []string, quit chan int, fetcher Fetcher) {
cache.Lock()
defer cache.Unlock()
if _, ok := cache.fetched[url] ; ok {
quit <- 0
return
}
body, urls, err := fetcher.Fetch(url)
cache.fetched[url] = true
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("Visited : %s, %q
", url, body)
c <- urls
}
// fakeFetcher is Fetcher that returns canned results.
type fakeFetcher map[string]*fakeResult
type fakeResult struct {
body string
urls []string
}
func (f fakeFetcher) Fetch(url string) (string, []string, error) {
if res, ok := f[url]; ok {
return res.body, res.urls, nil
}
return "", nil, fmt.Errorf("not found: %s", url)
}
// fetcher is a populated fakeFetcher.
var fetcher = fakeFetcher{
"http://golang.org/": &fakeResult{
"The Go Programming Language",
[]string{
"http://golang.org/pkg/",
"http://golang.org/cmd/",
},
},
"http://golang.org/pkg/": &fakeResult{
"Packages",
[]string{
"http://golang.org/",
"http://golang.org/cmd/",
"http://golang.org/pkg/fmt/",
"http://golang.org/pkg/os/",
},
},
"http://golang.org/pkg/fmt/": &fakeResult{
"Package fmt",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
"http://golang.org/pkg/os/": &fakeResult{
"Package os",
[]string{
"http://golang.org/",
"http://golang.org/pkg/",
},
},
}
Here's some sample output
Visited : http://golang.org/, "The Go Programming Language"
not found: http://golang.org/cmd/
Visited : http://golang.org/pkg/, "Packages"
Visited : http://golang.org/pkg/os/, "Package os"
**Visited : http://golang.org/pkg/fmt/, "Package fmt"**
Process finished with exit code 0
Different than the first the last package is missing (deliberately in asterisks above)
Visited : http://golang.org/, "The Go Programming Language"
not found: http://golang.org/cmd/
Visited : http://golang.org/pkg/, "Packages"
Visited : http://golang.org/pkg/os/, "Package os"
And finally, even a deadlock in some runs:
Visited : http://golang.org/, "The Go Programming Language"
not found: http://golang.org/cmd/
Visited : http://golang.org/pkg/, "Packages"
Visited : http://golang.org/pkg/os/, "Package os"
Visited : http://golang.org/pkg/fmt/, "Package fmt"
fatal error: all goroutines are asleep - deadlock!
goroutine 1 [select]:
main.Crawl(0x4bfdf9, 0x12, 0x4, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
/home/kostas/development/challenges/go/helloWorld.go:26 +0x201
main.main()
/home/kostas/development/challenges/go/helloWorld.go:39 +0xab
goroutine 23 [select]:
main.Crawl(0x4bfdf9, 0x12, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
/home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
/home/kostas/development/challenges/go/helloWorld.go:31 +0x123
goroutine 24 [select]:
main.Crawl(0x4c09f9, 0x16, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
/home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
/home/kostas/development/challenges/go/helloWorld.go:31 +0x123
goroutine 5 [select]:
main.Crawl(0x4bfdf9, 0x12, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
/home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
/home/kostas/development/challenges/go/helloWorld.go:31 +0x123
goroutine 6 [select]:
main.Crawl(0x4c0a0f, 0x16, 0x3, 0x524220, 0xc420088120, 0xc420092000, 0xc420092060)
/home/kostas/development/challenges/go/helloWorld.go:26 +0x201
created by main.Crawl
/home/kostas/development/challenges/go/helloWorld.go:31 +0x123
I assume it has something to do with concurrency and recursion. I've seen other solutions in git hub that use waiting group and such, but it's not used at the tutorials - tour of go so far so i'd rather not use it yet.
UPDATE
I figured out what is going on and working on the issue. Basically sometimes the select statement gets stuck in an endless loop because the channels quit and c don't always execute in the expected order. I added a default case that prints("nothing to do") and the program sometimes looped forever, sometimes executed by luck in a correct manner. My exit condition is not right
I think the case is quite clear. Your channels are messing. Multiple goroutines are recieving from a same channel, and golang just randomly pick one.
As you send a zero through quit
, you never know which goroutine quits: it is randomly picked by the go sheduler. It is possible that a newly generated Crawl recieved from quit
before recieving from c
(even if both channel are ready).
And due to that, the depth
is a mess and it makes numbers of safeVisit
being called unstable, resulting quit
issuing different (randomly) signal. Sometimes it is just not enough to quit all goroutines generated, and it is a deadlock.
Edit:
First you should understand what your task is. The Crawl
function takes in an url, a dep and a fetcher, and it:
Crawl
queue generated from the fetched url with dep-1
Though the tour ask you to "fetch" url in parellel, it is clear that step 2 and step 3 must happen after step 1, meaning it is normal for a single Crawl to wait for the fetch. That means, no need for a new goroutine to call Fetch
.
And on step 3 each new Crawl
call has no need to wait the previous to finish, so these calls should be parellel.
With these analysis, one can come to these code:
func Crawl(url string, depth int, fetcher Fetcher) {
// TODO: Fetch URLs in parallel.
// TODO: Don't fetch the same URL twice.
// This implementation doesn't do either:
if depth <= 0 {
return
}
body, urls, err := fetcher.Fetch(url)
if err != nil {
fmt.Println(err)
return
}
fmt.Printf("found: %s %q
", url, body)
for _, u := range urls {
go Crawl(u, depth-1, fetcher)
}
return
}
There is one more problem: dealing with a visited url. You have done it well, instead of sending a quit
, just make it func(string) bool
and call it directly: if Visited(Url) { return }
and it is done.
A side note: the tour is really not good at teaching concurency. You may want to look go blog articles, like golang concurency patterns or share memory by communicating.