在片中使用interfaces {}似乎会导致速度降低40倍。 实现基于“ interface {}”的数据结构时,有没有一种绕过此方法的方法?

I am currently trying to implement a tree based datastructure in Go and I am seeing disappointing results in my benchmarking. Because I am trying to be generic as to what values I accept, I am limited to using interface{}.

The code in question is an immutable vector trie. Essentially, any time a value in the vector is modified I need to make a copy of several nodes in the trie. Each of these nodes is implemented as a slice of const (known at compile time) length. For example, writing a value into a large trie will require the copying of 5 seperate 32 long slices. They must be copies to preserve the immutability of the previous contents.

I believe the disappointing benchmark results are because I am storing my data as interface{} in slices, which get created, copied and appended to often. To measure this I set up the following benchmark

package main

import (
    "math/rand"
    "testing"
)

func BenchmarkMake10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        _ = make([]int, 10e6, 10e6)
    }
}

func BenchmarkMakePtr10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        _ = make([]*int, 10e6, 10e6)
    }
}

func BenchmarkMakeInterface10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        _ = make([]interface{}, 10e6, 10e6)
    }
}

func BenchmarkMakeInterfacePtr10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        _ = make([]interface{}, 10e6, 10e6)
    }
}
func BenchmarkAppend10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        slc := make([]int, 0, 0)
        for jj := 0; jj < 10e6; jj++ {
            slc = append(slc, jj)
        }
    }
}

func BenchmarkAppendPtr10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        slc := make([]*int, 0, 0)
        for jj := 0; jj < 10e6; jj++ {
            slc = append(slc, &jj)
        }
    }
}

func BenchmarkAppendInterface10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        slc := make([]interface{}, 0, 0)
        for jj := 0; jj < 10e6; jj++ {
            slc = append(slc, jj)
        }
    }
}

func BenchmarkAppendInterfacePtr10M(b *testing.B) {
    for ii := 0; ii < b.N; ii++ {
        slc := make([]interface{}, 0, 0)
        for jj := 0; jj < 10e6; jj++ {
            slc = append(slc, &jj)
        }
    }
}

func BenchmarkSet(b *testing.B) {
    slc := make([]int, 10e6, 10e6)
    b.ResetTimer()
    for ii := 0; ii < b.N; ii++ {
        slc[rand.Intn(10e6-1)] = 1
    }
}

func BenchmarkSetPtr(b *testing.B) {
    slc := make([]*int, 10e6, 10e6)
    b.ResetTimer()
    for ii := 0; ii < b.N; ii++ {
        theInt := 1
        slc[rand.Intn(10e6-1)] = &theInt
    }
}

func BenchmarkSetInterface(b *testing.B) {
    slc := make([]interface{}, 10e6, 10e6)
    b.ResetTimer()
    for ii := 0; ii < b.N; ii++ {
        slc[rand.Intn(10e6-1)] = 1
    }
}

func BenchmarkSetInterfacePtr(b *testing.B) {
    slc := make([]interface{}, 10e6, 10e6)
    b.ResetTimer()
    for ii := 0; ii < b.N; ii++ {
        theInt := 1
        slc[rand.Intn(10e6-1)] = &theInt
    }
}

which gives the following result

BenchmarkMake10M-4                       300       4962381 ns/op
BenchmarkMakePtr10M-4                    100      10255522 ns/op
BenchmarkMakeInterface10M-4              100      19788588 ns/op
BenchmarkMakeInterfacePtr10M-4           100      19850682 ns/op
BenchmarkAppend10M-4                      20      67090711 ns/op
BenchmarkAppendPtr10M-4                    1    2784300818 ns/op
BenchmarkAppendInterface10M-4              1    3457503833 ns/op
BenchmarkAppendInterfacePtr10M-4           1    3532502711 ns/op
BenchmarkSet-4                      30000000            43.5 ns/op
BenchmarkSetPtr-4                   20000000            91.2 ns/op
BenchmarkSetInterface-4             30000000            43.5 ns/op
BenchmarkSetInterfacePtr-4          20000000            70.9 ns/op

Where the difference on Set and Make seems to be about 2-4x but the difference on Append is about 40x.

From what I understand the performance hit is because behind the scenes interfaces are implemented as pointers, and that pointers must be allocated on the heap. That still doesn't explain why Append is significantly worse than the difference between Set or Make.

Is there a way in the current language of Go without using a code generation tool (e.g., a generics tool that lets the consumer of the library generate a version of the library to store FooType) to work around this 40x performance hit? Alternatively, have I made some error in my benchmarking?

Let's profile the test with memory benchmarks.

go test -bench . -cpuprofile cpu.prof -benchmem
goos: linux
goarch: amd64
BenchmarkMake10M-8                           100          10254248 ns/op        80003282 B/op          1 allocs/op
BenchmarkMakePtr10M-8                        100          18696295 ns/op        80003134 B/op          1 allocs/op
BenchmarkMakeInterface10M-8                   50          34501361 ns/op        160006147 B/op         1 allocs/op
BenchmarkMakeInterfacePtr10M-8                50          35129085 ns/op        160006652 B/op         1 allocs/op
BenchmarkAppend10M-8                          20          69971722 ns/op        423503264 B/op        50 allocs/op
BenchmarkAppendPtr10M-8                        1        2135090501 ns/op        423531096 B/op        62 allocs/op
BenchmarkAppendInterface10M-8                  1        1833396620 ns/op        907567984 B/op  10000060 allocs/op
BenchmarkAppendInterfacePtr10M-8               1        2270970241 ns/op        827546240 B/op        53 allocs/op
BenchmarkSet-8                          30000000                54.0 ns/op             0 B/op          0 allocs/op
BenchmarkSetPtr-8                       20000000                91.6 ns/op             8 B/op          1 allocs/op
BenchmarkSetInterface-8                 30000000                58.0 ns/op             0 B/op          0 allocs/op
BenchmarkSetInterfacePtr-8              20000000                88.0 ns/op             8 B/op          1 allocs/op
PASS
ok      _/home/grzesiek/test    22.427s

We can see that the slowest benchmarks are the ones that makes allocations.

PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof
Total: 29.75s
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppend10M
     210m   1.51s (flat, cum)  5.08% of Total
     .      1.30s     4e827a: CALL runtime.growslice(SB)                 ;_/home/grzesiek/test.BenchmarkAppend10M test_test.go:35
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
     20m    930ms (flat, cum)  3.13% of Total
     .      630ms     4e8519: CALL runtime.growslice(SB)                 ;_/home/grzesiek/test.BenchmarkAppendInterface10M test_test.go:53
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterfacePtr10M
     0      800ms (flat, cum)  2.69% of Total
     .      770ms     4e8625: CALL runtime.growslice(SB)                 ;_/home/grzesiek/test.BenchmarkAppendInterfacePtr10M test_test.go:62
ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendPtr10M
     0      950ms (flat, cum)  3.19% of Total
     .      870ms     4e8374: CALL runtime.growslice(SB)                 ;_/home/grzesiek/test.BenchmarkAppendPtr10M test_test.go:44

By analyzing the number of bytes allocated, we can see that use of interface doubles the allocations size.

Why is BenchmarkAppendPtr10M so much faster than other BenchmarkAppend*? To figure this out, we need to see the escape analysis.

go test -gcflags '-m -l' original_test.go

./original_test.go:31:28: BenchmarkAppend10M b does not escape                      
./original_test.go:33:14: BenchmarkAppend10M make([]int, 0, 0) does not escape      
./original_test.go:40:31: BenchmarkAppendPtr10M b does not escape                   
./original_test.go:42:14: BenchmarkAppendPtr10M make([]*int, 0, 0) does not escape  
./original_test.go:43:7: moved to heap: jj                                          
./original_test.go:44:22: &jj escapes to heap                                       
./original_test.go:49:37: BenchmarkAppendInterface10M b does not escape             
./original_test.go:51:14: BenchmarkAppendInterface10M make([]interface {}, 0, 0) does not escape                                                                         
./original_test.go:53:16: jj escapes to heap                                        
./original_test.go:58:40: BenchmarkAppendInterfacePtr10M b does not escape          
./original_test.go:60:14: BenchmarkAppendInterfacePtr10M make([]interface {}, 0, 0) does not escape                                                                      
./original_test.go:61:7: moved to heap: jj                                          
./original_test.go:62:16: &jj escapes to heap                                       
./original_test.go:62:22: &jj escapes to heap                                       

We can see that it is the only benchmark in which jj does not escape to the heap. We can deduce that accessing heap variable causes slowdown.

Why does BenchmarkAppendInterface10M make so many allocations? In the assembler, we can see that it is the only one that calls runtime.convT2E64 function.

PPROF_BINARY_PATH=. go tool pprof -disasm BenchmarkAppend cpu.prof

ROUTINE ======================== _/home/grzesiek/test.BenchmarkAppendInterface10M
      30ms      1.10s (flat, cum)  3.35% of Total
         .      260ms     4e8490: CALL runtime.convT2E64(SB)                 

The source code from runtime/iface.go looks like this:

func convT2E64(t *_type, elem unsafe.Pointer) (e eface) {
    if raceenabled {
        raceReadObjectPC(t, elem, getcallerpc(), funcPC(convT2E64))
    }
    if msanenabled {
        msanread(elem, t.size)
    }
    var x unsafe.Pointer
    if *(*uint64)(elem) == 0 {
        x = unsafe.Pointer(&zeroVal[0])
    } else {
        x = mallocgc(8, t, false)
        *(*uint64)(x) = *(*uint64)(elem)
    }
    e._type = t
    e.data = x
    return
}

As we see it makes the allocation by calling mallocgc function.

I know it does not directly help fix your code but I hope it gives you the tools and techniques to analyze optimize it.