i have an application written in Go is doing message processing, need to pick up message from network (UDP) at a rate of 20K/second (potentially more), and each message can be up to UDP packet's maximum length (64KB-headersize), the program needs to decode this incoming packet and encode into another format and send to another network;
right now on a 24core+64GB RAM machine it runs ok, but occasionally lost some packets, the programming pattern is already following the pipelines using multiple go-routines / channels and it takes 10% of whole machine cpu load; so it has potential to use more CPU% or RAM to handle all 20K/s messages without losing any one; then i started profiling, following this profiling I found in cpu profile the runtime.mallocgc
appears the top one, that is garbage collector runtime, I suspect this GC could be the culprit it hangs for a few millisecond (or some microseconds) and lost some packets, and some best practices says switch to sync.Pool might help, but my switch to pool seemingly results more CPU contention and lost even more packets and more often
(pprof) top20 -cum (sync|runtime)
245.99s of 458.81s total (53.61%)
Dropped 487 nodes (cum <= 22.94s)
Showing top 20 nodes out of 22 (cum >= 30.46s)
flat flat% sum% cum cum%
0 0% 0% 440.88s 96.09% runtime.goexit
1.91s 0.42% 1.75% 244.87s 53.37% sync.(*Pool).Get
64.42s 14.04% 15.79% 221.57s 48.29% sync.(*Pool).getSlow
94.29s 20.55% 36.56% 125.53s 27.36% sync.(*Mutex).Lock
1.62s 0.35% 36.91% 72.85s 15.88% runtime.systemstack
22.43s 4.89% 41.80% 60.81s 13.25% runtime.mallocgc
22.88s 4.99% 46.79% 51.75s 11.28% runtime.scanobject
1.78s 0.39% 47.17% 49.15s 10.71% runtime.newobject
26.72s 5.82% 53.00% 39.09s 8.52% sync.(*Mutex).Unlock
0.76s 0.17% 53.16% 33.74s 7.35% runtime.gcDrain
0 0% 53.16% 33.70s 7.35% runtime.gcBgMarkWorker
0 0% 53.16% 33.69s 7.34% runtime.gcBgMarkWorker.func2
the use of pool is the standard
// create this one globally at program init
var rfpool = &sync.Pool{New: func() interface{} { return new(aPrivateStruct); }}
// get
rf := rfpool.Get().(*aPrivateStruct)
// put after done processing this message
rfpool.Put(rf)
not sure am I doing wrong? or wonder what are the other ways could tune the GC to use less CPU%? the go version is 1.8
the list shows a lot of lock contention happened in the pool.getSlow src to pool.go at golang.org
(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
1.07mins 3.69mins (flat, cum) 48.29% of Total
. . 144: x = p.New()
. . 145: }
. . 146: return x
. . 147:}
. . 148:
80ms 80ms 149:func (p *Pool) getSlow() (x interface{}) {
. . 150: // See the comment in pin regarding ordering of the loads.
30ms 30ms 151: size := atomic.LoadUintptr(&p.localSize) // load-acquire
180ms 180ms 152: local := p.local // load-consume
. . 153: // Try to steal one element from other procs.
30ms 130ms 154: pid := runtime_procPin()
20ms 20ms 155: runtime_procUnpin()
730ms 730ms 156: for i := 0; i < int(size); i++ {
51.55s 51.55s 157: l := indexLocal(local, (pid+i+1)%int(size))
580ms 2.01mins 158: l.Lock()
10.65s 10.65s 159: last := len(l.shared) - 1
40ms 40ms 160: if last >= 0 {
. . 161: x = l.shared[last]
. . 162: l.shared = l.shared[:last]
. 10ms 163: l.Unlock()
. . 164: break
. . 165: }
490ms 37.59s 166: l.Unlock()
. . 167: }
40ms 40ms 168: return x
. . 169:}
. . 170:
. . 171:// pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
. . 172:// Caller must call runtime_procUnpin() when done with the pool.
. . 173:func (p *Pool) pin() *poolLocal {
https://golang.org/pkg/sync/#Pool
a free list maintained as part of a short-lived object is not a suitable use for a Pool, since the overhead does not amortize well in that scenario. It is more efficient to have such objects implement their own free list
https://dave.cheney.net/2015/11/29/a-whirlwind-tour-of-gos-runtime-environment-variables
sync.Pool runs slowly with a high concurrency load. Try to allocate all the structures once during the startup and use it many times. For example, you may create several goroutines (workers) at start time, instead of run new goroutine on each request. I recommend read this article: https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .
Go 1.13 (Q4 2019) might change that: see CL 166961.
The original issue was issue 22950: "sync: avoid clearing the full Pool on every GC"
where I find it surprising that there is around 1000 allocations again every cycle. This seems to indicate that the Pool is clearing its entire contents upon every GC.
A peek at the implementation seems to indicate that this is so.
Result:
sync
: smooth out Pool behavior over GC with a victim cacheCurrently, every Pool is cleared completely at the start of each GC.
This is a problem for heavy users of Pool because it causes an allocation spike immediately after Pools are clear, which impacts both throughput and latency.This CL fixes this by introducing a victim cache mechanism.
Instead of clearing Pools, the victim cache is dropped and the primary cache is moved to the victim cache.
As a result, in steady-state, there are (roughly) no new allocations, but if Pool usage drops, objects will still be collected within two GCs (as opposed to one).
This victim cache approach also improves Pool's impact on GC dynamics.
The current approach causes all objects in Pools to be short lived. However, if an application is in steady state and is just going to repopulate its Pools, then these objects impact the live heap size as if they were long lived.
Since Pooled objects count as short lived when computing the GC trigger and goal, but act as long lived objects in the live heap, this causes GC to trigger too frequently.
If Pooled objects are a non-trivial portion of an application's heap, this increases the CPU overhead of GC. The victim cache lets Pooled objects affect the GC trigger and goal as long-lived objects.This has no impact on
Get
/Put
performance, but substantially reduces the impact to the Pool user when a GC happens.PoolExpensiveNew
demonstrates this in the substantially reduction in the rate at which the "New
" function is called.