为什么“并发” Go GC阶段似乎是世界末日？

I am attempting to benchmark the maximum STW GC pause time for different numbers of heap objects. To do this I have written a simple benchmark that pushes and pops messages from a map:

package main

type message []byte

type channel map[int]message

const (
    windowSize = 200000
    msgCount   = 1000000
)

func mkMessage(n int) message {
    m := make(message, 1024)
    for i := range m {
        m[i] = byte(n)
    }
    return m
}

func pushMsg(c *channel, highID int) {
    lowID := highID - windowSize
    m := mkMessage(highID)
    (*c)[highID] = m
    if lowID >= 0 {
        delete(*c, lowID)
    }
}

func main() {
    c := make(channel)
    for i := 0; i < msgCount; i++ {
        pushMsg(&c, i)
    }
}

I ran this with GODEBUG=gctrace=1, and on my machine the output is:

gc 1 @0.004s 2%: 0.007+0.44+0.032 ms clock, 0.029+0.22/0.20/0.28+0.12 ms cpu, 4->4->3 MB, 5 MB goal, 4 P
gc 2 @0.009s 3%: 0.007+0.64+0.042 ms clock, 0.030+0/0.53/0.18+0.17 ms cpu, 7->7->7 MB, 8 MB goal, 4 P
gc 3 @0.019s 1%: 0.007+0.99+0.037 ms clock, 0.031+0/0.13/1.0+0.14 ms cpu, 13->13->13 MB, 14 MB goal, 4 P
gc 4 @0.044s 2%: 0.009+2.3+0.032 ms clock, 0.039+0/2.3/0.30+0.13 ms cpu, 25->25->25 MB, 26 MB goal, 4 P
gc 5 @0.081s 1%: 0.009+9.2+0.082 ms clock, 0.039+0/0.32/9.7+0.32 ms cpu, 49->49->48 MB, 50 MB goal, 4 P
gc 6 @0.162s 0%: 0.020+10+0.078 ms clock, 0.082+0/0.28/11+0.31 ms cpu, 93->93->91 MB, 96 MB goal, 4 P
gc 7 @0.289s 0%: 0.020+27+0.092 ms clock, 0.080+0/0.95/28+0.37 ms cpu, 178->178->173 MB, 182 MB goal, 4 P
gc 8 @0.557s 1%: 0.023+38+0.086 ms clock, 0.092+0/38/10+0.34 ms cpu, 337->339->209 MB, 346 MB goal, 4 P
gc 9 @0.844s 1%: 0.008+40+0.077 ms clock, 0.032+0/5.6/46+0.30 ms cpu, 407->409->211 MB, 418 MB goal, 4 P
gc 10 @1.100s 1%: 0.009+43+0.047 ms clock, 0.036+0/6.6/50+0.19 ms cpu, 411->414->212 MB, 422 MB goal, 4 P
gc 11 @1.378s 1%: 0.008+45+0.093 ms clock, 0.033+0/6.5/52+0.37 ms cpu, 414->417->213 MB, 425 MB goal, 4 P

See the link above for documentation on this output.

My version of Go is:

$ go version
go version go1.7.1 darwin/amd64

From the above results, the longest wall clock STW pause time is 0.093 ms. Great!

However as a sanity check I also manually timed how long it took to create a new message by wrapping mkMessage with

start := time.Now()
m := mkMessage(highID)
elapsed := time.Since(start)

and printed the slowest elapsed time. The time I get for this was 38.573036ms!

I was instantly suspicious because this correlated strongly with the wall clock times in the supposedly concurrent mark/scan phase, and in particular with "idle GC time".

My question is: why does this supposedly concurrent phase of the GC appear to block the mutator?

If I force the GC to run at regular intervals, my manually calculated pause times go way down to <1ms, so it appears to be hitting some kind of limit of non-live heap objects. If so, I'm not sure what that limit is, and why it would cause a concurrent phase of the GC to appear to block the mutator.

A concurrent GC pass will typically start at a time when a certain amount of space is available for new allocations. If that amount of space is adequate to handle all allocations that occur before the pass is complete, the amount of time the application has to wait on the GC will be minimal. If, however, the quantity of allocations exceeds the space available, new allocation requests will have to wait for the GC to free up more space.

Note that most concurrent GC systems have definite cycles even though programs can run while they are in progress. The bulk of a GC cycle is spent identifying all of the objects to which references exist, and the only way to identify that no references exist to any individual object is to identify all the references that exist to everything else. Consequently, everything that will get freed by a GC cycle will get freed at once, and nothing will get freed in the interim.

To give this more visibility (and detail) than in my comment--the behavior seems to match what's described in Go bug #9477 and fixed by Go changelist 23540.

What happens is, when a thread allocates memory, the garbage collector might give it a little memory-scanning work to do; this is called a "mutator assist." But previously, the work might consist of scanning a map, which would cause a long pause with a big enough map. Consistent with what's reported in the question and comments, the pause would happen when some arbitrary allocation took place--not necessarily while using the map--and would not show up in GC pause stats since it wasn't part of the stop-the-world phase.

The fix made to Go, not yet released as I'm writing, was to limit how much memory a single mutator assist would scan so that it would take no more than about 100 microseconds on typical modern server hardware. If you're encountering this problem I'd recommend trying the latest from the github.com/golang/go repository (or 1.8 once it's out) where the fix is applied. (There's also another big pause-reduction change in there since 1.7.) If you still have problems and haven't been able to find any explanation, I'd go to golang-nuts.