如何检测到什么阻止了在golang中使用多个内核?

So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.

There are two large vectors with input/output values

var (
    input = make([]float64, rowCount)
    output = make([]float64, rowCount)
)

these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:

var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
    go func(id int) {
         var wd float64
         // eg nw = 4
         // worker0, i = 0, 4, 8, 12...
         // worker1, i = 1, 5, 9, 13...
         // worker2, i = 2, 6, 10, 14...
         // worker3, i = 3, 7, 11, 15...
         for i := id; i < rowCount; i += nw {
             res := compute(input[i])
             wd += distance(res, output[i])
         }
         ch <- wd
    }(w)
}
// Compute total distance
for w := 0; w < nw; w++ {
    d += <-ch
}

The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.

The problem I'm having is that this code is no faster than the serial code.

Now, I'm using Go 1.7 so runtime.GOMAXPROCS should be already set to runtime.NumCPU(), but even setting it explicitly does not improves performances.

  • distance is just (a-b)*(a-b);
  • compute is a bit more complex, but should be reentrant and use global data only for reading (and uses math.Pow and math.Sqrt functions);
  • no other goroutine is running.

So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand, for example).

I also compiled with -race and nothing emerged.

My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.

I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.

How can I debug this kind of issues? Can pprof help me in this case? What about the runtime package?

Thanks in advance

Sorry, but in the end I got the measurement wrong. @JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.

My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.

After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.

Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!