在Golang中爬网而没有加载到内存中

I'm trying to rewrite a web crawler in go (originally written in python with gevent). But I've hit a wall, no matter what I do I get fast high memory consumption. For example, the following simple code:

package main

import (
     "bufio"
     "fmt"
     "os"
     "net/http"
     "io"
     "strings"
     "time"
)

func readLine(in *bufio.Reader, domains chan<- string) {
    for conc := 0; conc < 500; conc++ {
        input, err := in.ReadString('
')
        if err == io.EOF {
            break
        }
        if err != nil {
            fmt.Fprintf(os.Stderr, "read(stdin): %s
", err)
            os.Exit(1)
        }

        input = strings.TrimSpace(input)
        if input == "" {
            continue
        }

        domain := input
        domains <- domain
    }
}

func get(domains <-chan string) {
   url := <-domains
   URLresp, err := http.Get(url)
   if err != nil {
       fmt.Println(err)
   }
   if err == nil {
       fmt.Println(url," OK")
       URLresp.Body.Close()
   }
}

func main() {
    domains := make(chan string, 500)

    inFile, _ := os.Open("F:\\PATH\\TO\\LIST_OF_URLS_SEPARATED_BY_NEWLINE.txt")
    in := bufio.NewReader(inFile)

    for {
        go readLine(in, domains)
        for i := 0; i < 500; i++ { go get(domains) }
        time.Sleep(100000000)
    }
}

I've tried pprof but it seems to say I'm using 50mb of heap space only, while memory consumption by resource monitoring is skyrocketing.

I also tried creating a custom http Transport without Keep Alive since I found out net/http saves connections for reuse but no luck with that.

Let's consider what's wrong with your code, focusing on your main() function.

func main() {
    domains := make(chan string, 500)

This is good. You create a buffered channel to handle the domain list input. No problem.

    inFile, _ := os.Open("F:\\PATH\\TO\\LIST_OF_URLS_SEPARATED_BY_NEWLINE.txt")

You open the input file. You should never ignore errors, but we'll ignore that for now.

    in := bufio.NewReader(inFile)

    for {

Here you start an infinite loop. Why?

        go readLine(in, domains)

Here you read up to the next 500 lines from the in file, passing them to the domains channel, but you do it in the background, which means the next line will execute before readLine has a chance to finish.

        for i := 0; i < 500; i++ { go get(domains) }

Here you call get(domains) 500 times, in parallel. But as explained above, you do this before readLine has completed, so (at least on the first time through the outer loop), most calls to get() will fail, because the domains channel is likely empty. The get() function doesn't handle this case properly, but I'll leave that for you to consider.

        time.Sleep(100000000)

Then you sleep for 0.1 seconds before starting the infinite loop again.

    }
}

The infinite loop will then attempt, again, to read the next 500 items from your file, again, in the background. If the first call to readLine takes more than 0.1 seconds to complete, then you'll have two copies of readLine simultaneously trying to read the file, which will probably cause a panic.

Assuming this is behaving as you expect (although it most certainly and obviously is not), after reading in all the URLs in the file, the program will continue, forever, to spawn an additional 501 go routines every 0.1 seconds. One go routine attempts to read more lines from the file, finds that there are no more, and exits immediately. The other 500 go routines will end up waiting, forever, to read a non-existant result from the domains channel. This is your memory "leak".

The problem was the lack of a default timeout in golang net Dial. It would hog resources by not letting goroutines die. The following Works:

c := &http.Client{
 Transport: &http.Transport{
     DisableKeepAlives: true,
     Dial: (&net.Dialer{
             Timeout:   30 * time.Second,
             KeepAlive: 30 * time.Second,
     }).Dial,
     TLSHandshakeTimeout:   10 * time.Second,
     ResponseHeaderTimeout: 10 * time.Second,
     ExpectContinueTimeout: 1 * time.Second,}}

URLresp, err := c.Get(url)