无需外部依赖的高性能网络蜘蛛

I'm trying to write my first web-spider in Golang. Its task is to crawl domains (and inspect their html) from the provided database query. The idea is to have no 3rd party dependencies (e.g. msg queue), or as little as possible, yet it has to be performant enough to crawl 5 million domains per day. I have approx 150 million domains I need to check every month.

The very basic version below - it runs in "infinite loop" as theoretically the crawl process would be endless.

func crawl(n time.Duration) {
    var wg sync.WaitGroup
    runtime.GOMAXPROCS(runtime.NumCPU())

    for _ = range time.Tick(n * time.Second) {
        wg.Add(1)

        go func() {
            defer wg.Done()

            // do the expensive work here - query db, crawl domain, inspect html
        }()
    }
    wg.Wait()
}

func main() {
    go crawl(1)

    select{}
}

Running this code on 4 CPU cores at the moment means it can perform max 345600 requests during 24 hours ((60 * 60 * 24) * 4) with the given threshold of 1s. At least that's my understanding :-) If my thinking's correct then I will need to come up with solution being 14x faster to meet daily requirements.

I would appreciate your advices in regards to make the crawler faster, but without resolving to complicated stack setup or buying server with more CPU cores.

Why have the timing component at all?

Just create a channel that you feed URLs to, then spawn N goroutines that loop over that channel and do the work.

then just tweak the value of N until your CPU/memory is capped ~90% utilization (to accommodate fluctuations in site response times)

something like this (on Play):

package main

import "fmt"
import "sync"

var numWorkers = 10

func crawler(urls chan string, wg *sync.WaitGroup) {
    defer wg.Done()
    for u := range urls {
        fmt.Println(u)
    }
}
func main() {
    ch := make(chan string)
    var wg sync.WaitGroup
    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go crawler(ch, &wg)
    }
    ch <- "http://ibm.com"
    ch <- "http://google.com"
    close(ch)
    wg.Wait()
    fmt.Println("All Done")
}