逐位运算性能之谜

While benchmarking the performance of conversion from byte arrays to uint32s, I noticed that the conversion ran faster when starting with the least significant bits:

package blah

import (
    "testing"
    "encoding/binary"
    "bytes"
)

func BenchmarkByteConversion(t *testing.B) {
    var i uint32 = 3419234848
    buf := new(bytes.Buffer)
    _ = binary.Write(buf, binary.BigEndian, i)
    b := buf.Bytes()

    for n := 0; n < t.N; n++ {
        // Start with least significant bit: 0.27 nanos
        value := uint32(b[3]) | uint32(b[2])<<8 | uint32(b[2])<<16 | uint32(b[0])<<24

        // Start with most significant bit: 0.68 nanos
        // value := uint32(b[0])<<24 | uint32(b[1])<<16 | uint32(b[2])<<8 | uint32(b[3])
        _ = value
    }
}

When I run go test -bench=., I get 0.27 nanos per iteration when computing value the first way, and 0.68 nanos per iteration when computing value the second way. Why is it twice as fast to start with the least significant bits when |ing together numbers?

There is no mystery. Optimization!

package blah

import (
    "bytes"
    "encoding/binary"
    "testing"
)

func BenchmarkByteConversionLeast(t *testing.B) {
    var i uint32 = 3419234848
    buf := new(bytes.Buffer)
    _ = binary.Write(buf, binary.BigEndian, i)
    b := buf.Bytes()

    for n := 0; n < t.N; n++ {
        // Start with least significant bit: 0.27 nanos
        value := uint32(b[3]) | uint32(b[2])<<8 | uint32(b[2])<<16 | uint32(b[0])<<24
        _ = value
    }
}

func BenchmarkByteConversionMost(t *testing.B) {
    var i uint32 = 3419234848
    buf := new(bytes.Buffer)
    _ = binary.Write(buf, binary.BigEndian, i)
    b := buf.Bytes()

    for n := 0; n < t.N; n++ {
        // Start with most significant bit: 0.68 nanos
        value := uint32(b[0])<<24 | uint32(b[1])<<16 | uint32(b[2])<<8 | uint32(b[3])
        _ = value
    }
}

Output:

go test silly_test.go -bench=.
goos: linux
goarch: amd64
BenchmarkByteConversionLeast-4      2000000000           0.72 ns/op
BenchmarkByteConversionMost-4       2000000000           1.80 ns/op

It should be obvious. Bounds check elimination.

Just use common sense. If you check the array bounds for indices 3, 2, 1, and 0 you can stop checking at 3 since, obviously, 2, 1, and 0 will also be valid. If you check the array bounds for indices 0, 1, 2, and 3 you have to check the bounds for them all. One bounds check versus four bounds checks.

Wikipedia: Bounds checking

Wikipedia: Bounds-checking elimination

You should also be reading good code, like the Go standard library. For example,

func (littleEndian) PutUint64(b []byte, v uint64) {
    _ = b[7] // early bounds check to guarantee safety of writes below
    b[0] = byte(v)
    b[1] = byte(v >> 8)
    b[2] = byte(v >> 16)
    b[3] = byte(v >> 24)
    b[4] = byte(v >> 32)
    b[5] = byte(v >> 40)
    b[6] = byte(v >> 48)
    b[7] = byte(v >> 56)
}