Goroutines(CGO):使用Goroutines时产生无法解释的OS线程

I am using go to parallelize 2d convolutions where the convolution (implemented in go) is happening in a c-archive included in a C binary (where the go code is called). No calls are made from the go code to any c function

Before spawning goroutines, all matrixes are loaded into memory by the c code and all goroutines access it through the shared memory.

I use the GOMAXPROCS-1 to decide how many go routines to spawn and each routine is assigned a ID. The goroutines are assigned rows of the matrix based on their ID in a striped fashion. The go routines are locked to a OS thread when spawned and release the thread once finished.

e.g. if GOMAXPROCS is set to 4, goroutine 0 takes row 0, 4, 8, 12 etc and goroutine 1 takes row 1, 5, 9, 13 and so on.

My issue is that when GOMAXPROCS is set to 4, go spawns 11 OS threads

htop and atop: enter image description here

My understanding is that these OS threads are spawned because the scheduler is trying to make sure that there are always threads available that are not blocked.

There is no I/O or system calls happening after the goroutines have been spawned so I don't understand why the scheduler is creating all these processes or what is blocking the threads.

The number of threads being spawned is slowing down the execution when executing with GOMAXPROCS >=20 on a machine with 40 cores

Why is the scheduler spawning all these threads? How can I debug where/how the routines are being blocked?

Source code