I am trying to scrape related words for a given word for which I am using BFS starting with the given word and searching through each related word on dictionary.com
I have tried this code without concurrency and it works just fine, but takes a lot of time hence, tried using go routines but my code gets stuck after the first iteration. The first level of BFS works just fine but then in the second level it hangs!
package main
import (
"fmt"
"github.com/gocolly/colly"
"sync"
)
var wg sync.WaitGroup
func buildURL(word string) string {
return "https://www.dictionary.com/browse/" + string(word)
}
func get(url string) []string {
c := colly.NewCollector()
c.IgnoreRobotsTxt = true
var ret []string
c.OnHTML("a.css-cilpq1.e15p0a5t2", func(e *colly.HTMLElement) {
ret = append(ret, string(e.Text))
})
c.Visit(url)
c.Wait()
return ret
}
func threading(c chan []string, word string) {
defer wg.Done()
var words []string
for _, w := range get(buildURL(word)) {
words = append(words, w)
}
c <- words
}
func main() {
fmt.Println("START")
word := "jump"
maxDepth := 2
//bfs
var q map[string]int
nq := map[string]int {
word: 0,
}
vis := make(map[string]bool)
queue := make(chan []string, 5000)
for i := 1; i <= maxDepth; i++ {
fmt.Println(i)
q, nq = nq, make(map[string]int)
for word := range q {
if _, ok := vis[word]; !ok {
wg.Add(1)
vis[word] = true
go threading(queue, word)
for v := range queue {
fmt.Println(v)
for _, w := range v {
nq[w] = i
}
}
}
}
}
wg.Wait()
close(queue)
fmt.Println("END")
}
OUTPUT:
START
1
[plunge dive rise upsurge bounce hurdle fall vault drop advance upturn inflation increment spurt boost plummet skip bound surge take]
hangs just here forever, counter = 2 is not printed!
Can check here https://www.dictionary.com/browse/jump for the related words.
According to Tour of Go
Sends to a buffered channel block only when the buffer is full. Receives block when the buffer is empty.
So, in this case, you are creating a buffered channel using 5000 as length.
for i := 1; i <= maxDepth; i++ {
fmt.Println(i)
q, nq = nq, make(map[string]int)
for word := range q { // for each word
if _, ok := vis[word]; !ok { // if not visited visit
wg.Add(1) // add a worker
vis[word] = true
go threading(queue, word) // fetch in concurrent manner
for v := range queue { // <<< blocks here when queue is empty
fmt.Println(v)
for _, w := range v {
nq[w] = i
}
}
}
}
}
As you can see I've commented in the code, after 1st iteration the for loop gonna block until channel is empty. In this case after fetching jump
It sends the array corresponding similar words, but after that as the for loop is blocking as zerkems explains you will not get to next iteration(i = 2). You can ultimately close the channel to end the blocking in for loop. But since you use the same channel to write over multiple goroutines it will panic if you closed it from multiple goroutines.
To overcome this we can come up with a nice workaround.
First, we need to count the unvisited words and then we can iterate that much of the time
vis := make(map[string]bool)
queue := make(chan []string, 5000)
for i := 1; i <= maxDepth; i++ {
fmt.Println(i)
q, nq = nq, make(map[string]int)
unvisited := 0
for word := range q {
if _, ok := vis[word]; !ok {
vis[word] = true
unvisited++
wg.Add(1)
go threading(queue, word)
}
}
wg.Wait() // wait until jobs are done
for j := 0; j < unvisited; j++ { // << does not block as we know how much
v := <-queue // we exactly try to get unvisited amount
fmt.Println(v)
for _, w := range v {
nq[w] = i
}
}
}
In this situation, we are simply counting what is the minimum iterations we need to go to get results. Also, you can see that I've moved down the for loop outer and use original one to just add words to workers. It will ask to fetch all words and will wait in the following loop to complete there tasks in a non-blocking way.
Latter loop waits until all workers are done. After that next iteration, works and next level of BFS can be reached.
Summary
Hope this helps.