Attention机制中的QKV到底需不需要偏置?

最近在学习attention机制时发现了这个一个问题

在多头自注意力机制时,初始化是这样的:

        self.W_Q = nn.Linear(self.d_model, self.d_q * self.n_heads, bias=False)
        self.W_K = nn.Linear(self.d_model, self.d_k * self.n_heads, bias=False)
        self.W_V = nn.Linear(self.d_model, self.d_v * self.n_heads, bias=False)
        self.fc = nn.Linear(self.n_heads * self.d_v, self.d_model, bias=False)

此时的bias=False

但是在bert预训练模型时,打印网络结构:

      (q) Linear(d_model = 768, d_model = 768, bias = True)
      (k) Linear(d_model = 768, d_model = 768, bias = True)
      (v) Linear(d_model = 768, d_model = 768, bias = True)

到底哪个是对的?