Multi-head attention进行分割时,是如何分割的?为什么这样做?
Parameters:
x: Tensor
A tensor with shape [batch_size, seq_length, depth]
Returns:
A tensor with shape [batch_size, num_heads, seq_length, depth / num_heads]
想要图解