注意力机制的维度和时间步应用

在使用注意力机制碰到的问题，困扰许久
1、目前普通的注意力机制在scdn查看时碰到两种写法，在对时间维度的注意力应用中，一个是博主自写的，代码如下；另一种是项目中直接导入attention包的做法。非常疑惑是否一致，以及该博主的代码是否正确，感谢回复。

def attention_3d_block(inputs):
    # inputs.shape = (batch_size, time_steps, input_dim)
    input_dim = int(inputs.shape[2])
    a = Permute((2, 1))(inputs)
    a = Reshape((input_dim, TIME_STEPS))(a) # this line is not useful. It's just to know which dimension is what.
    a = Dense(TIME_STEPS, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((2, 1), name='attention_vec')(a)
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul

2、同样，问题一提到的项目中使用方法如下，该方法是对时间的注意力应用还是维度层面的应用，如何去看呢

    num_samples, time_steps, input_dim, output_dim = 100, 10, 1, 1
    data_x = np.random.uniform(size=(num_samples, time_steps, input_dim))
    data_y = np.random.uniform(size=(num_samples, output_dim))
    model_input = Input(shape=(time_steps, input_dim))
    x = LSTM(64, return_sequences=True)(model_input)
    x = Attention(units=32)(x)
    x = Dense(1)(x)
    model = Model(model_input, x)
    model.compile(loss='mae', optimizer='adam')

3、同样是问题一的博主编写的对于维度的注意力应用，代码如下，想知道这样是否可行呢。

def attention_3d_block(inputs):
    input_dim = int(inputs.shape[2])
    a = inputs
    a = Dense(input_dim, activation='softmax')(a)
    if SINGLE_ATTENTION_VECTOR:
        a = Lambda(lambda x: K.mean(x, axis=1), name='dim_reduction')(a)
        a = RepeatVector(input_dim)(a)
    a_probs = Permute((1, 2), name='attention_vec')(a)
    output_attention_mul = merge([inputs, a_probs], name='attention_mul', mode='mul')
    return output_attention_mul

4、如果我有一个多维度多时间步的数据，我希望对维度层面和时间步层面都进行注意力应用，目前看到的一个结构是在编码器使用空间注意力，解码器使用时间注意力，具体的模型结构应该是怎样呢，Input-空间注意力-lstm-时间注意力-output吗
感谢各位的回复。

参考GPT和自己的思路：

这个是一个使用3D注意力机制处理时序数据的函数。它的输入是一个形状为(batch_size, time_steps, input_dim)的张量，其中batch_size表示批量大小，time_steps表示时间步数，input_dim表示每步输入的维度。该函数首先通过Permute函数转换输入张量的维度，即将维度为(time_steps, input_dim)的矩阵转置，并将形状变为(input_dim, time_steps)，这一步是为了使每个时间步所代表的向量在计算时能够进行dot product，因为dot product只支持最后两维之间的矩阵乘法计算。然后reshape操作仅仅是为了让你在输出中能够知道每个维度的意义，并没有实际作用。下一步是经过一个全连接层，将(input_dim, time_steps)的矩阵转换为(time_steps, time_steps)的矩阵，因为softmax操作只能作用于矩阵的行，其目的是计算每个时间步的注意力权重。如果SINGLE_ATTENTION_VECTOR为True，则使用Lambda层将每个时间步的注意力权重求平均，得到一个形状为(batch_size, 1, time_steps)的张量，用RepeatVector层拓展为原来的维度，即形状为(batch_size, time_steps, time_steps)的张量。最后，通过merge函数和mode参数设置为'mul'，实现输入和注意力权重的Hadamard积，即实现了注意力机制。

参考文献：

[1] Yang, Kaizhi, et al. "Hierarchical attention networks for document classification." Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2016.

[2] Zhou, Peng, et al. "End-to-end attention-based large vocabulary speech recognition." arXiv preprint arXiv:1701.02720 (2017).

以下答案由GPT-3.5大模型与博主波罗歌共同编写：

关于注意力机制维度和时间步的应用
普通的注意力机制在处理序列数据时，通常是在时间维度上使用的。常见的注意力机制如Bahdanau Attention，使用了两个参数矩阵作为权重，用于计算每一步输入与前面输入的关联度。另外一个常见的注意力机制是self-attention机制，在Transformer模型中广泛应用。它不仅在时间维度上计算注意力，而且在维度（或称为特征）上也计算了注意力。这种机制被证明在机器翻译等任务上非常有效。
两种不同的注意力机制实现方式
第一种方式是使用了全连接层（Dense）将输入张量转化为权重向量，再乘以输入张量进行加权。第二种方式是使用了Attention层，将输入张量与权重张量进行乘法，然后将结果加权求和。这两种方式在实现上略有不同，但是都是在时间维度上计算了注意力。
对维度上的注意力机制的实现方式
第三种方式也是使用全连接层将输入张量转化为权重向量，与第一种方式基本一致，但是它没有对时间步进行维度变化。因此，它更适用于处理多维度的数据。
编码器空间注意力、解码器时间注意力的模型结构
对于多维度多时间步的序列数据，可以同时在维度和时间步上使用注意力机制。具体的模型结构如下：

from keras.layers import Input, Dense, LSTM, Concatenate, Permute, Reshape, Dot
from keras.models import Model

def attention_3d_block(hidden_states):
    # hidden_states.shape = (batch_size, time_steps, hidden_size)
    hidden_size = int(hidden_states.shape[2])
    score_first_part = Dense(hidden_size, use_bias=False, name='attention_score_vec')(hidden_states)
    score_second_part = Dense(hidden_size, use_bias=False, name='attention_score_time')(hidden_states)
    attention_scores = Permute((2, 1), name='attention_scores')(Dot(axes=(2, 1))([score_first_part, score_second_part]))
    attention_scores = Dense(hidden_size, activation='softmax', name='attention_weights')(attention_scores)
    context_vector = Dot(axes=(1, 1))([attention_scores, hidden_states])
    return context_vector

input_data = Input(shape=(time_steps, input_dim))
lstm_output = LSTM(64, return_sequences=True)(input_data)

# 编码器空间注意力
attention_vector1 = attention_3d_block(lstm_output)

# 解码器时间注意力
lstm_output2 = LSTM(32, return_sequences=True)(attention_vector1)
attention_vector2 = attention_3d_block(lstm_output2)

output = Dense(output_dim, activation='softmax')(attention_vector2)
model = Model(inputs=input_data, outputs=output)
model.compile(loss='categorical_crossentropy', optimizer='adam')

其中，attention_3d_block()函数为实现维度和时间步双重注意力的函数。输入的hidden_states为LSTM层的输出（注意，这里设置了return_sequences=True，因此输出是一个3D张量），该函数通过两次Dense层，将输入张量转化为注意力分数张量。然后将这个张量沿着时间步和维度分别求和，得到了最终的注意力分数矩阵。进一步使用softmax函数得到注意力权重向量，最后使用Dot方法将权重向量与原始输入张量进行加权求和得到上下文向量。

在模型中，使用LSTM层处理输入数据，然后经过空间注意力层得到概括了所有维度信息的上下文向量。接下来，再使用LSTM层处理上下文向量，然后经过时间注意力层得到概括了所有时间步信息的最终隐藏状态。最后，使用Dense层根据输出维度，得到最终输出。
如果我的回答解决了您的问题，请采纳！

如果你要对维度层面和时间步层面都进行注意力应用，那么你需要改造下注意力层，对时间步和维度都应用注意力机制。先要对原始输入，通过两个全连接层，分别计算对时间步和维度的注意力分配矩阵，然后将矩阵与原始输入逐元素积得到注意力层的输出。具体代码实现，你可以参考下这位博主的：https://blog.csdn.net/qq_41864229/article/details/107215380

你可以参考下这篇文章：注意力机制概念和框架