golang GC分析? runtime.mallocgc似乎排名第一; 然后要同步。合并解决方案?

i have an application written in Go is doing message processing, need to pick up message from network (UDP) at a rate of 20K/second (potentially more), and each message can be up to UDP packet's maximum length (64KB-headersize), the program needs to decode this incoming packet and encode into another format and send to another network;

right now on a 24core+64GB RAM machine it runs ok, but occasionally lost some packets, the programming pattern is already following the pipelines using multiple go-routines / channels and it takes 10% of whole machine cpu load; so it has potential to use more CPU% or RAM to handle all 20K/s messages without losing any one; then i started profiling, following this profiling I found in cpu profile the runtime.mallocgc appears the top one, that is garbage collector runtime, I suspect this GC could be the culprit it hangs for a few millisecond (or some microseconds) and lost some packets, and some best practices says switch to sync.Pool might help, but my switch to pool seemingly results more CPU contention and lost even more packets and more often

(pprof) top20 -cum (sync|runtime)
245.99s of 458.81s total (53.61%)
Dropped 487 nodes (cum <= 22.94s)
Showing top 20 nodes out of 22 (cum >= 30.46s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    440.88s 96.09%  runtime.goexit
     1.91s  0.42%  1.75%    244.87s 53.37%  sync.(*Pool).Get
    64.42s 14.04% 15.79%    221.57s 48.29%  sync.(*Pool).getSlow
    94.29s 20.55% 36.56%    125.53s 27.36%  sync.(*Mutex).Lock
     1.62s  0.35% 36.91%     72.85s 15.88%  runtime.systemstack
    22.43s  4.89% 41.80%     60.81s 13.25%  runtime.mallocgc
    22.88s  4.99% 46.79%     51.75s 11.28%  runtime.scanobject
     1.78s  0.39% 47.17%     49.15s 10.71%  runtime.newobject
    26.72s  5.82% 53.00%     39.09s  8.52%  sync.(*Mutex).Unlock
     0.76s  0.17% 53.16%     33.74s  7.35%  runtime.gcDrain
         0     0% 53.16%     33.70s  7.35%  runtime.gcBgMarkWorker
         0     0% 53.16%     33.69s  7.34%  runtime.gcBgMarkWorker.func2

the use of pool is the standard

// create this one globally at program init
var rfpool = &sync.Pool{New: func() interface{} { return new(aPrivateStruct); }}

// get
rf := rfpool.Get().(*aPrivateStruct)

// put after done processing this message
rfpool.Put(rf)

not sure am I doing wrong? or wonder what are the other ways could tune the GC to use less CPU%? the go version is 1.8

the list shows a lot of lock contention happened in the pool.getSlow src to pool.go at golang.org

(pprof) list sync.*.getSlow
Total: 7.65mins
ROUTINE ======================== sync.(*Pool).getSlow in /opt/go1.8/src/sync/pool.go
  1.07mins   3.69mins (flat, cum) 48.29% of Total
         .          .    144:       x = p.New()
         .          .    145:   }
         .          .    146:   return x
         .          .    147:}
         .          .    148:
      80ms       80ms    149:func (p *Pool) getSlow() (x interface{}) {
         .          .    150:   // See the comment in pin regarding ordering of the loads.
      30ms       30ms    151:   size := atomic.LoadUintptr(&p.localSize) // load-acquire
     180ms      180ms    152:   local := p.local                         // load-consume
         .          .    153:   // Try to steal one element from other procs.
      30ms      130ms    154:   pid := runtime_procPin()
      20ms       20ms    155:   runtime_procUnpin()
     730ms      730ms    156:   for i := 0; i < int(size); i++ {
    51.55s     51.55s    157:       l := indexLocal(local, (pid+i+1)%int(size))
     580ms   2.01mins    158:       l.Lock()
    10.65s     10.65s    159:       last := len(l.shared) - 1
      40ms       40ms    160:       if last >= 0 {
         .          .    161:           x = l.shared[last]
         .          .    162:           l.shared = l.shared[:last]
         .       10ms    163:           l.Unlock()
         .          .    164:           break
         .          .    165:       }
     490ms     37.59s    166:       l.Unlock()
         .          .    167:   }
      40ms       40ms    168:   return x
         .          .    169:}
         .          .    170:
         .          .    171:// pin pins the current goroutine to P, disables preemption and returns poolLocal pool for the P.
         .          .    172:// Caller must call runtime_procUnpin() when done with the pool.
         .          .    173:func (p *Pool) pin() *poolLocal {

https://golang.org/pkg/sync/#Pool

a free list maintained as part of a short-lived object is not a suitable use for a Pool, since the overhead does not amortize well in that scenario. It is more efficient to have such objects implement their own free list

  1. You may try to set GOGC value to greater then 100.

https://dave.cheney.net/2015/11/29/a-whirlwind-tour-of-gos-runtime-environment-variables

  1. Or, implement your own free list.

http://golang-jp.org/doc/effective_go.html#leaky_buffer

sync.Pool runs slowly with a high concurrency load. Try to allocate all the structures once during the startup and use it many times. For example, you may create several goroutines (workers) at start time, instead of run new goroutine on each request. I recommend read this article: https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs .

Go 1.13 (Q4 2019) might change that: see CL 166961.

The original issue was issue 22950: "sync: avoid clearing the full Pool on every GC"

where I find it surprising that there is around 1000 allocations again every cycle. This seems to indicate that the Pool is clearing its entire contents upon every GC.
A peek at the implementation seems to indicate that this is so.

Result:

sync: smooth out Pool behavior over GC with a victim cache

Currently, every Pool is cleared completely at the start of each GC.
This is a problem for heavy users of Pool because it causes an allocation spike immediately after Pools are clear, which impacts both throughput and latency.

This CL fixes this by introducing a victim cache mechanism.

Instead of clearing Pools, the victim cache is dropped and the primary cache is moved to the victim cache.

As a result, in steady-state, there are (roughly) no new allocations, but if Pool usage drops, objects will still be collected within two GCs (as opposed to one).

This victim cache approach also improves Pool's impact on GC dynamics.
The current approach causes all objects in Pools to be short lived. However, if an application is in steady state and is just going to repopulate its Pools, then these objects impact the live heap size as if they were long lived.
Since Pooled objects count as short lived when computing the GC trigger and goal, but act as long lived objects in the live heap, this causes GC to trigger too frequently.
If Pooled objects are a non-trivial portion of an application's heap, this increases the CPU overhead of GC. The victim cache lets Pooled objects affect the GC trigger and goal as long-lived objects.

This has no impact on Get/Put performance, but substantially reduces the impact to the Pool user when a GC happens.
PoolExpensiveNew demonstrates this in the substantially reduction in the rate at which the "New" function is called.