并发:Chudnovky的算法,比同步慢

I just recently started learning go at the recommendation of a friend. So far, I am loving it, but I wrote (what I thought would be) the perfect example of the power of lightweight concurrency, and got a surprising result... so I suspect I'm doing something wrong, or I'm misunderstanding how expensive goroutines are. I'm hoping some gophers here can provide insight.

I wrote Chudnovsky's algorithm in Go using both goroutines and simple synchronous execution. I assumed, with each calculation being independent of the others, it'd be at least a little faster running concurrently.

note: I am running this on a 5th gen i7, so if goroutines are multiplexed onto threads as I was told, this should be both concurrent and parallel.

 package main

import (
  "fmt"
  "math"
  "strconv"
  "time"
)

func main() {
  var input string
  var sum float64
  var pi float64
  c := make(chan float64)

  fmt.Print("How many iterations? ")
  fmt.Scanln(&input)
  max,err := strconv.Atoi(input)

  if err != nil {
    panic("You did not enter a valid integer")
  }
  start := time.Now() //start timing execution of concurrent routine

  for i := 0; i < max; i++ {
    go chudnovskyConcurrent(i,c)
  }

  for i := 0; i < max; i++ {
    sum += <-c
  }
  end := time.Now() //end of concurrent routine
  fmt.Println("Duration of concurrent calculation: ",end.Sub(start))
  pi = 1/(12*sum)
  fmt.Println(pi)

  start = time.Now() //start timing execution of syncronous routine
  sum = 0
  for i := 0; i < max; i++ {
    sum += chudnovskySync(i)
  }
  end = time.Now() //end of syncronous routine
  fmt.Println("Duration of synchronous calculation: ",end.Sub(start))
  pi = 1/(12*sum)
  fmt.Println(pi)
}

func chudnovskyConcurrent(i int, c chan<- float64) {
  var numerator float64
  var denominator float64
  ifloat := float64(i)
  iun := uint64(i)
  numerator = math.Pow(-1, ifloat) * float64(factorial(6*iun)) * (545140134*ifloat+13591409)
  denominator = float64(factorial(3*iun)) * math.Pow(float64(factorial(iun)),3) * math.Pow(math.Pow(640320,3),ifloat+0.5)
  c <- numerator/denominator
}

func chudnovskySync(i int) (r float64) {
  var numerator float64
  var denominator float64
  ifloat := float64(i)
  iun := uint64(i)
  numerator = math.Pow(-1, ifloat) * float64(factorial(6*iun)) * (545140134*ifloat+13591409)
  denominator = float64(factorial(3*iun)) * math.Pow(float64(factorial(iun)),3) * math.Pow(math.Pow(640320,3),ifloat+0.5)
  r = numerator/denominator
  return
}

func factorial(n uint64) (res uint64) {
  if ( n > 0 ) {
    res = n * factorial(n-1)
    return res
  }

  return 1
}

And here are my results:

How many iterations? 20
Duration of concurrent calculation:  573.944µs
3.1415926535897936
Duration of synchronous calculation:  63.056µs
3.1415926535897936

The calculations you're doing are too simple to do each one of them in a separate goroutine. You're loosing more time in the runtime (creating goroutines, multiplexing, scheduling etc) than on actual calculations. What you're trying to do is more suited for GPU, for example, where you have massive number of parallel execution units that can do this simple calculations all at ones in an instant. But you would need other languages and APIs to do that.

What you can do is to create software thread of execution for every hardware thread of execution. You want to split your max variable into big chunks and execute them in parallel. Here's very simple take on it just to illustrate the idea:

package main

import (
  "fmt"
  "math"
  "strconv"
  "time"
  "runtime"
)

func main() {
  var input string
  var sum float64
  var pi float64
  c := make(chan float64, runtime.GOMAXPROCS(-1))
  fmt.Print("How many iterations? ")
  fmt.Scanln(&input)
  max,err := strconv.Atoi(input)

  if err != nil {
    panic("You did not enter a valid integer")
  }
  start := time.Now() //start timing execution of concurrent routine

  for i := 0; i < runtime.GOMAXPROCS(-1); i++ {
    go func(i int){
      var sum float64
      for j := 0; j < max/runtime.GOMAXPROCS(-1); j++  {
        sum += chudnovskySync(j + i*max/runtime.GOMAXPROCS(-1))
      }
      c <- sum
    }(i)
  }

  for i := 0; i < runtime.GOMAXPROCS(-1); i++ {
    sum += <-c
  }
  end := time.Now() //end of concurrent routine
  fmt.Println("Duration of concurrent calculation: ",end.Sub(start))
  pi = 1/(12*sum)
  fmt.Println(pi)

  start = time.Now() //start timing execution of syncronous routine
  sum = 0
  for i := 0; i < max; i++ {
    sum += chudnovskySync(i)
  }
  end = time.Now() //end of syncronous routine
  fmt.Println("Duration of synchronous calculation: ",end.Sub(start))
  pi = 1/(12*sum)
  fmt.Println(pi)
}

func chudnovskySync(i int) (r float64) {
  var numerator float64
  var denominator float64
  ifloat := float64(i)
  iun := uint64(i)
  numerator = math.Pow(-1, ifloat) * float64(factorial(6*iun)) * (545140134*ifloat+13591409)
  denominator = float64(factorial(3*iun)) * math.Pow(float64(factorial(iun)),3) * math.Pow(math.Pow(640320,3),ifloat+0.5)
  r = numerator/denominator
  return
}

func factorial(n uint64) (res uint64) {
  if ( n > 0 ) {
    res = n * factorial(n-1)
    return res
  }

  return 1
}

And here's the results

$ go version
go version go1.5.2 windows/amd64

$ go run main.go
GOMAXPROCS = 4
How many iterations? 10000
Duration of concurrent calculation:  932.8916ms
NaN
Duration of synchronous calculation:  2.0639744s
NaN 

I agree, your calculations don't do enough processing to overcome the overhead of having multiple goroutines. Just for fun, I modified your code to do the calculation many times (1000, 10000, 100000, 1000000) before returning the result. I ran this (with 20 iterations) under Mac OS X Yosemite running on a quad core Xeon, and, as you might expect, the synchronous version takes in the neighborhood of four times as long as the parallel version.

One interesting thing I noticed is that, with a large number of repetitions, the synchronous version actually takes more than four times as long as the parallel version. I'm guessing this has something to do with Intel's hyperthreading architecture which allows some level of parallelism within each core, but I'm not sure about that.