最近在学习attention机制时发现了这个一个问题
在多头自注意力机制时,初始化是这样的:
self.W_Q = nn.Linear(self.d_model, self.d_q * self.n_heads, bias=False)
self.W_K = nn.Linear(self.d_model, self.d_k * self.n_heads, bias=False)
self.W_V = nn.Linear(self.d_model, self.d_v * self.n_heads, bias=False)
self.fc = nn.Linear(self.n_heads * self.d_v, self.d_model, bias=False)
此时的bias=False
但是在bert预训练模型时,打印网络结构:
(q) Linear(d_model = 768, d_model = 768, bias = True)
(k) Linear(d_model = 768, d_model = 768, bias = True)
(v) Linear(d_model = 768, d_model = 768, bias = True)
到底哪个是对的?