连续调用函数，耗时更少，无法解释

问题描述

我正在写一个低延时的程序，发现一个现象，越频繁的调用一个函数，函数执行得越快，无法理解。为了简化问题，写了如下简单的测试程序：

#include <stdio.h>
#include <unistd.h>
#include <sys/time.h>
#include <syscall.h>
#include <thread>

using namespace std;

long long get_curr_nsec()
{
    struct timespec now;
    ::clock_gettime(CLOCK_MONOTONIC, &now);
    return now.tv_sec * 1000000000 + now.tv_nsec;
}

long long func(int n)
{
    long long t1 = get_curr_nsec();
    int sum = 0;
    for(int i = 0; i < n ;i++)
    {
        //make sure sum*= (sum+1) not be optimized by compiler
        __asm__ __volatile__("": : :"memory");
        sum *= (sum+1);
    }

    return get_curr_nsec() - t1;
}

bool bind_cpu(int cpu_id, pthread_t tid)
{
    int cpu = (int)sysconf(_SC_NPROCESSORS_ONLN);
    cpu_set_t cpu_info;
    
    if (cpu < cpu_id)
    {
        printf("bind cpu failed: cpu num[%d] < cpu_id[%d]\n", cpu, cpu_id);
        return false;
    }
    
    CPU_ZERO(&cpu_info);
    CPU_SET(cpu_id, &cpu_info);
    
    int ret = pthread_setaffinity_np(tid, sizeof(cpu_set_t), &cpu_info);
    if (ret)
    {
        printf("bind cpu failed, ret=%d\n", ret);
        return false;
    }
    
    return true;
}
int main(int argc, char **argv)
{
    //make sure the program would not swich cpu
    bind_cpu(3, ::pthread_self());

    //参数1：调用函数的次数
    //参数2：每次调用之间的间隔时间，采用sleep实现
    int times = ::atoi(argv[1]);
    int interval = ::atoi(argv[2]);

    long long sum = 0;
    for(int i = 0; i < times; i++)
    {
        if(n > 0)
        {
                std::this_thread::sleep_for(std::chrono::milliseconds(interval));
        }
        sum +=  func(100);
    }

    printf("avg elapse:%lld ns\n", sum/ times);
    return 0;
}

服务器操作系统是：CentOS Linux release 7.6.1810 (Core)
CPU是：Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
使用GCC编译，编译指令：

g++ --std=c++11 ./main.cpp  -O2 -lpthread

运行结果

我的服务器做了核隔离，可以保证cpu3上只有这个测试程序运行。

执行命令	平均耗时（纳秒）
./a.out 100 0	35
./a.out 100 1	36
./a.out 100 10	40
./a.out 100 100	45
./a.out 100 1000	50

现象补充：

不睡眠时，耗时非常稳定，就是35纳秒。
有了睡眠之后，耗时会有小的波动，睡眠100毫秒时，耗时峰值可能达到100ns。

我的分析

我怀疑过的因素：

cpu 缓存。我的CPU是每个核有自己的L1、L2缓存，3级缓存所有核共享。睡眠时间越长，其他进程执行的可能性就越大，L3就越可能失效。
分支预测。我使用perf stat 运行程序，分支预测真的有差异：
perf stat ./a.out 100 1, there are 241779 branches,7091 branch-misses;
perf stat ./a.out 100 100, there are 241791 branches, 7636 branch-misses.

我想要的结果

我想确认是什么因素造成的。有怀疑，但是无法确认。
知道原因之后，看看能否做一些改进。

milliseconds是微秒，不是纳秒。
程序在运行的时候，实质上是抢占CUP的工作时间片，也就是测试程序会跟其他程序抢占CPU的工作时间片（有一些程序是核隔离无法阻止的），当测试程序无休眠的时候，抢占CPU时间片的概率就越大，反之就越小（一个常见的现象是，当有一个死循环不停的执行时，CPU会被占满，也就是死循环程序抢占CPU时间片的概率大大增加）。所以休眠的时间越短，运行的速度可能就会越快