I'm using Go with the gotk3 and webkit2 libraries to try and build a web crawler that can parse JavaScript in the context of a WebKitWebView.
Thinking of performance, I'm trying to figure out what would be the best way to have it crawl concurrently (if not in parallel, with multiple processors), using all available resources.
GTK and everything with threads and goroutines are pretty new to me. Reading from the gotk3 goroutines example, it states:
Native GTK is not thread safe, and thus, gotk3's GTK bindings may not be used from other goroutines. Instead, glib.IdleAdd() must be used to add a function to run in the GTK main loop when it is in an idle state.
Go will panic and show a stack trace when I try to run a function, which creates a new WebView, in a goroutine. I'm not exactly sure why this happens, but I think it has something to do with this comment. An example is shown below.
Here's my current code, which has been adapted from the webkit2 example:
package main
import (
"fmt"
"github.com/gotk3/gotk3/glib"
"github.com/gotk3/gotk3/gtk"
"github.com/sourcegraph/go-webkit2/webkit2"
"github.com/sqs/gojs"
)
func crawlPage(url string) {
web := webkit2.NewWebView()
web.Connect("load-changed", func(_ *glib.Object, i int) {
loadEvent := webkit2.LoadEvent(i)
switch loadEvent {
case webkit2.LoadFinished:
fmt.Printf("Load finished for: %v
", url)
web.RunJavaScript("window.location.hostname", func(val *gojs.Value, err error) {
if err != nil {
fmt.Println("JavaScript error.")
} else {
fmt.Printf("Hostname (from JavaScript): %q
", val)
}
//gtk.MainQuit()
})
}
})
glib.IdleAdd(func() bool {
web.LoadURI(url)
return false
})
}
func main() {
gtk.Init(nil)
crawlPage("https://www.google.com")
crawlPage("https://www.yahoo.com")
crawlPage("https://github.com")
crawlPage("http://deelay.me/2000/http://deelay.me/img/1000ms.gif")
gtk.Main()
}
It seems that creating a new WebView for each URL allows them to load concurrently. Having glib.IdleAdd()
running in a goroutine, as per the gotk3 example, doesn't seem to have any effect (although I'm only doing a visual benchmark):
go glib.IdleAdd(func() bool { // Works
web.LoadURI(url)
return false
})
However, trying to create a goroutine for each crawlPage()
call ends in a panic:
go crawlPage("https://www.google.com") // Panics and shows stack trace
I can run web.RunJavaScript()
in a goroutine without issue:
switch loadEvent {
case webkit2.LoadFinished:
fmt.Printf("Load finished for: %v
", url)
go web.RunJavaScript("window.location.hostname", func(val *gojs.Value, err error) { // Works
if err != nil {
fmt.Println("JavaScript error.")
} else {
fmt.Printf("Hostname (from JavaScript): %q
", val)
}
//gtk.MainQuit()
})
}
The current methods I can think of are:
What would actually be the best way to run this code concurrently, if possible, and max out performance?
Method 1 and 2 are probably out of the picture, as I ran a test by spawning ~100 WebViews and they seem to load synchronously.