I am currently trying to implement a tree based datastructure in Go and I am seeing disappointing results in my benchmarking. Because I am trying to be generic as to what values I accept, I am limited to using interface{}
.
The code in question is an immutable vector trie. Essentially, any time a value in the vector is modified I need to make a copy of several nodes in the trie. Each of these nodes is implemented as a slice of const (known at compile time) length. For example, writing a value into a large trie will require the copying of 5 seperate 32 long slices. They must be copies to preserve the immutability of the previous contents.
I believe the disappointing benchmark results are because I am storing my data as interface{}
in slices, which get created, copied and appended to often. To measure this I set up the following benchmark
package main
import (
"math/rand"
"testing"
)
func BenchmarkMake10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]int, 10e6, 10e6)
}
}
func BenchmarkMakePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]*int, 10e6, 10e6)
}
}
func BenchmarkMakeInterface10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]interface{}, 10e6, 10e6)
}
}
func BenchmarkMakeInterfacePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
_ = make([]interface{}, 10e6, 10e6)
}
}
func BenchmarkAppend10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]int, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, jj)
}
}
}
func BenchmarkAppendPtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]*int, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, &jj)
}
}
}
func BenchmarkAppendInterface10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]interface{}, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, jj)
}
}
}
func BenchmarkAppendInterfacePtr10M(b *testing.B) {
for ii := 0; ii < b.N; ii++ {
slc := make([]interface{}, 0, 0)
for jj := 0; jj < 10e6; jj++ {
slc = append(slc, &jj)
}
}
}
func BenchmarkSet(b *testing.B) {
slc := make([]int, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
slc[rand.Intn(10e6-1)] = 1
}
}
func BenchmarkSetPtr(b *testing.B) {
slc := make([]*int, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
theInt := 1
slc[rand.Intn(10e6-1)] = &theInt
}
}
func BenchmarkSetInterface(b *testing.B) {
slc := make([]interface{}, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
slc[rand.Intn(10e6-1)] = 1
}
}
func BenchmarkSetInterfacePtr(b *testing.B) {
slc := make([]interface{}, 10e6, 10e6)
b.ResetTimer()
for ii := 0; ii < b.N; ii++ {
theInt := 1
slc[rand.Intn(10e6-1)] = &theInt
}
}
which gives the following result
BenchmarkMake10M-4 300 4962381 ns/op
BenchmarkMakePtr10M-4 100 10255522 ns/op
BenchmarkMakeInterface10M-4 100 19788588 ns/op
BenchmarkMakeInterfacePtr10M-4 100 19850682 ns/op
BenchmarkAppend10M-4 20 67090711 ns/op
BenchmarkAppendPtr10M-4 1 2784300818 ns/op
BenchmarkAppendInterface10M-4 1 3457503833 ns/op
BenchmarkAppendInterfacePtr10M-4 1 3532502711 ns/op
BenchmarkSet-4 30000000 43.5 ns/op
BenchmarkSetPtr-4 20000000 91.2 ns/op
BenchmarkSetInterface-4 30000000 43.5 ns/op
BenchmarkSetInterfacePtr-4 20000000 70.9 ns/op
Where the difference on Set and Make seems to be about 2-4x but the difference on Append is about 40x.
From what I understand the performance hit is because behind the scenes interfaces are implemented as pointers, and that pointers must be allocated on the heap. That still doesn't explain why Append is significantly worse than the difference between Set or Make.
Is there a way in the current language of Go without using a code generation tool (e.g., a generics tool that lets the consumer of the library generate a version of the library to store FooType
) to work around this 40x performance hit? Alternatively, have I made some error in my benchmarking?
Let's profile the test with memory benchmarks.
go test -bench . -cpuprofile cpu.prof -benchmem
goos: linux
goarch: amd64
BenchmarkMake10M-8 100 10254248 ns/op 80003282 B/op 1 allocs/op
BenchmarkMakePtr10M-8 100 18696295 ns/op 80003134 B/op 1 allocs/op
BenchmarkMakeInterface10M-8 50 34501361 ns/op 160006147 B/op 1 allocs/op
BenchmarkMakeInterfacePtr10M-8 50 35129085 ns/op 160006652 B/op 1 allocs/op
BenchmarkAppend10M-8 20 69971722 ns/op 423503264 B/op 50 allocs/op
BenchmarkAppendPtr10M-8 1 2135090501 ns/op 423531096 B/op 62 allocs/op
BenchmarkAppendInterface10M-8 1 1833396620 ns/op 907567984 B/op 10000060 allocs/op
BenchmarkAppendInterfacePtr10M-8 1 2270970241 ns/op 827546240 B/op 53 allocs/op
BenchmarkSet-8 30000000 54.0 ns/op 0 B/op 0 allocs/op
BenchmarkSetPtr-8 20000000 91.6 ns/op 8 B/op 1 allocs/op
BenchmarkSetInterface-8 30000000 58.0 ns/op 0 B/op 0 allocs/op
BenchmarkSetInterfacePtr-8 20000000 88.0 ns/op 8 B/op 1 allocs/op
PASS
ok _/home/grzesiek/test 22.427s
We can see that the slowest benchmarks are the ones that makes allocations.
PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof
Total: 29.75s
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppend10M
210m 1.51s (flat, cum) 5.08% of Total
. 1.30s 4e827a: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppend10M test_test.go:35
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
20m 930ms (flat, cum) 3.13% of Total
. 630ms 4e8519: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendInterface10M test_test.go:53
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterfacePtr10M
0 800ms (flat, cum) 2.69% of Total
. 770ms 4e8625: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendInterfacePtr10M test_test.go:62
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendPtr10M
0 950ms (flat, cum) 3.19% of Total
. 870ms 4e8374: CALL runtime.growslice(SB) ;_/home/grzesiek/test.BenchmarkAppendPtr10M test_test.go:44
By analyzing the number of bytes allocated, we can see that use of interface doubles the allocations size.
Why is BenchmarkAppendPtr10M
so much faster than other BenchmarkAppend*
? To figure this out, we need to see the escape analysis.
go test -gcflags '-m -l' original_test.go
./original_test.go:31:28: BenchmarkAppend10M b does not escape
./original_test.go:33:14: BenchmarkAppend10M make([]int, 0, 0) does not escape
./original_test.go:40:31: BenchmarkAppendPtr10M b does not escape
./original_test.go:42:14: BenchmarkAppendPtr10M make([]*int, 0, 0) does not escape
./original_test.go:43:7: moved to heap: jj
./original_test.go:44:22: &jj escapes to heap
./original_test.go:49:37: BenchmarkAppendInterface10M b does not escape
./original_test.go:51:14: BenchmarkAppendInterface10M make([]interface {}, 0, 0) does not escape
./original_test.go:53:16: jj escapes to heap
./original_test.go:58:40: BenchmarkAppendInterfacePtr10M b does not escape
./original_test.go:60:14: BenchmarkAppendInterfacePtr10M make([]interface {}, 0, 0) does not escape
./original_test.go:61:7: moved to heap: jj
./original_test.go:62:16: &jj escapes to heap
./original_test.go:62:22: &jj escapes to heap
We can see that it is the only benchmark in which jj
does not escape to the heap. We can deduce that accessing heap variable causes slowdown.
Why does BenchmarkAppendInterface10M
make so many allocations? In the assembler, we can see that it is the only one that calls runtime.convT2E64
function.
PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
30ms 1.10s (flat, cum) 3.35% of Total
. 260ms 4e8490: CALL runtime.convT2E64(SB)
The source code from runtime/iface.go
looks like this:
func convT2E64(t *_type, elem unsafe.Pointer) (e eface) {
if raceenabled {
raceReadObjectPC(t, elem, getcallerpc(), funcPC(convT2E64))
}
if msanenabled {
msanread(elem, t.size)
}
var x unsafe.Pointer
if *(*uint64)(elem) == 0 {
x = unsafe.Pointer(&zeroVal[0])
} else {
x = mallocgc(8, t, false)
*(*uint64)(x) = *(*uint64)(elem)
}
e._type = t
e.data = x
return
}
As we see it makes the allocation by calling mallocgc
function.
I know it does not directly help fix your code but I hope it gives you the tools and techniques to analyze optimize it.