trax使用transformer示例问题

使用20万条数据训练transformer后，bleu得分一直为0，不知道是哪里有问题呢? 附上源码：

import os
os.environ["TF_KERAS"] = '1'
import sacrebleu
from absl.testing import absltest
from absl.testing import parameterized
import numpy as np
import trax
from trax import layers as tl, shapes
import encoder

class TransformerResultGet(parameterized.TestCase):
  def test_transformer_BLEU(self):

    refs = []
    with sacrebleu.smart_open('./WIT_DATA/europarl-v7.de-en.de') as f_read:
      for lineno, line in enumerate(f_read, 1):
        if line.endswith('\n'):
          line = line[:-1]
          refs.append(line)

    srcs = []
    with sacrebleu.smart_open('./WIT_DATA/europarl-v7.de-en.en') as fd_read:
      for lineno, line in enumerate(fd_read, 1):
        if line.endswith('\n'):
          line = line[:-1]
          srcs.append(line)

    assert len(refs)==len(srcs)
    examples = [np.array([srcs[i], refs[i]]) for i in range(len(srcs))]


    data_pipeline = trax.data.Serial(
      encoder.BpeTokenize(),
      trax.data.Shuffle(),
      trax.data.FilterByLength(max_length=200, length_keys=[0]),
      trax.data.BucketByLength(boundaries=[32, 128,256,512],
                               batch_sizes=[10, 10, 10, 10],
                               length_keys=[0]),
    trax.data.AddLossWeights()
    )

    train_batches_stream = data_pipeline(examples[:np.int32(len(examples)*0.9)])
    eval_batches_stream = data_pipeline(examples[np.int32(len(examples)*0.90):np.int32(len(examples)*0.95)])

    trans_model = trax.models.Transformer(
      input_vocab_size=50257,
      d_model=256, d_ff=2048,
      n_heads=8, n_encoder_layers=6, n_decoder_layers=6,
      max_len=300)

    shape11 = shapes.ShapeDtype((1, 300), dtype=np.int32)
    _, _ = trans_model.init((shape11,shape11))
    trans_model.init_from_file('./output_dir/model.pkl.gz',
                         weights_only=True)

    from trax.supervised import training

    train_task = training.TrainTask(
      labeled_data=train_batches_stream,
      loss_layer=trax.layers.WeightedCategoryCrossEntropy(),
      optimizer=trax.optimizers.Adam(0.001)
    )

    eval_task = training.EvalTask(
      labeled_data=eval_batches_stream,
      metrics=[tl.WeightedCategoryCrossEntropy()],
      n_eval_batches=1
    )

    output_dir = os.path.expanduser('./output_dir/')
    training_loop = training.Loop(trans_model,
                                  train_task,
                                  eval_tasks=[eval_task],
                                  eval_at=lambda step_n: step_n % 1000 == 0,
                                  checkpoint_at=lambda step_n: step_n % 2000 == 0,
                                  output_dir=output_dir
                                  )
    training_loop.run(200000)

if __name__ == '__main__':
  absltest.main()

其中encoder部分代码：

@debug_data_pipeline.debug_pipeline
def bpeToken(stream):
    enc = get_encoder('124M', '/home/dcjack/GPT-2/gpt-2/models')
    start_token = enc.decode([200])
    end_token = enc.decode([199])
    for example in stream:
        output = np.array([np.array(enc.encode(start_token+example[0]+end_token)),np.array(enc.encode(start_token+example[1]+end_token))])
        yield output

@gin.configurable(module='trax.data')
def BpeTokenize(  # pylint: disable=invalid-name
    n_reserved_ids=0):
  """Returns a function that maps text to integer arrays; see `tokenize`."""
  return lambda g: bpeToken(  # pylint: disable=g-long-lambda
      g)
```python


运行10万条数据 预测结果：


```python
The debate is the report is a important point.
The Commission is the first of the European Union, the European Union is a important of the European Union.
The Commission is the first of the European Union, the European Union is a important of the European Union, which is a important of the European Union.
The Commission is the first of the European Union, the European Union is a important of the European Union.
The Commission is the first of the European Union, the European Union is a important of the European Union.
The Commission is the question of the European Union is a important.
The Commission is the first of the European Union, the European Union is a important of the European Union.
The Commission is the first of the European Union, the European Union is a important of the European Union, but we have been able to make the European Union.
The Commission is the first of the European Union, the European Union is a important of the European Union.
The Commission is the first of the European Union, the European Union is a important of the European Union.

BLEU = 0.00, 15.2/0.6/0.0/0.0 (BP=1.000, ration=1.098)

对应的翻译原文：

A Republican strategy to counter the re-election of Obama
Republican leaders justified their policy by the need to combat electoral fraud.
However, the Brennan Centre considers this a myth, stating that electoral fraud is rarer in the United States than the number of people killed by lightning.
Indeed, Republican lawyers identified only 300 cases of electoral fraud in the United States in a decade.
One thing is certain: these new provisions will have a negative impact on voter turn-out.
In this sense, the measures will partially undermine the American democratic system.
Unlike in Canada, the American States are responsible for the organisation of federal elections in the United States.
It is in this spirit that a majority of American governments have passed new laws since 2009 making the registration or voting process more difficult.
This phenomenon gained momentum following the November 2010 elections, which saw 675 new Republican representatives added in 26 States.
As a result, 180 bills restricting the exercise of the right to vote in 41 States were introduced in 2011 alone.

这里需要进行误差分析来确定BLEU分数没有上升的原因。首先，可以通过使用更小的数据子集来跑模型，例如只使用前1000个例子。这样可以确定问题是否与数据相关，如果BLEU分数仍然没有改进，那么可以考虑其他因素。其次，可以使用不同的超参数设置（例如更大的batch size、更小的学习率等）来训练模型，以确定它们是否会导致BLEU分数的变化。最后，可以使用一些工具（例如TensorBoard）来可视化训练的进展和模型的性能，以找出模型可能存在的问题。

此外，在代码中的一些细节也需要注意。

在这份代码中，BLEU得分没有提高的原因可能有多种，以下是一些可能的原因：

训练数据量不足：使用20万条数据训练transformer模型可能不够，如果有可能，可以尝试增加训练数据的数量。
训练次数不够：在这份代码中，训练次数是200000次，这个次数是否足够取决于训练数据集的大小和模型复杂度等因素，如果没有足够的训练次数，BLEU得分也可能无法提高。
学习率设置不合适：代码中使用的Adam优化器学习率为0.001，这个值可能不是最优的，可以尝试调整学习率的值。
模型结构不合适：代码中使用的transformer模型结构可能不够适合具体的翻译任务，可以尝试调整模型结构或使用其他类型的模型。
数据预处理不合适：代码中使用的数据预处理可能不够适合具体的翻译任务，可以尝试调整数据预处理的方法。

针对这些问题，可以尝试采取以下步骤：

增加训练数据的数量：如果有可能，可以尝试增加训练数据的数量。
增加训练次数：可以尝试增加训练次数，观察BLEU得分的变化情况。
调整学习率：可以尝试调整学习率的值，观察BLEU得分的变化情况。
调整模型结构：可以尝试调整模型结构或使用其他类型的模型，观察BLEU得分的变化情况。
调整数据预处理方法：可以尝试调整数据预处理的方法，观察BLEU得分的变化情况。
总之，在解决这个问题的过程中，需要不断尝试不同的方法，观察BLEU得分的变化情况，找到最优的模型和参数组合。

您的代码看起来没有什么问题，BLEU分数不上升可能是由于以下原因之一：

1.数据量太小：虽然20万条数据听起来很多，但是对于训练Transformer来说可能仍然不足够。尝试增加训练数据集的大小，或者使用预训练的模型进行微调。

2.训练次数不足：您的代码中训练次数是200000，这看起来是足够的，但是根据您的数据集大小和模型大小，也可能需要更多的训练次数。

3.超参数不合适：调整超参数可能有助于提高BLEU分数。尝试使用更小或更大的模型，调整学习率，批量大小等。

4.数据预处理错误：您的数据预处理可能存在问题，例如长度截断可能会影响模型的训练。您可以尝试使用更长或更短的句子进行训练，或者调整截断的长度。

建议您逐一排查上述原因，以找到提高BLEU分数的最佳方法。

大概率是数据集的问题，质量是否够高，数据集是否够丰富。对于数据集质量问题：如果数据集质量较低，即包含噪声、语法错误或不一致的数据，可能会导致BLEU得分下降。需要尝试使用更高质量的数据集或数据清洗方法来提高数据集质量。对于数据集大小问题/丰富程度问题：如果数据集过小，可能会导致过拟合或无法捕捉到足够的语言特征。需要尝试使用更大的数据集或数据增强技术来扩展数据集。

可能是模型的参数设置不合适，比如说设置的学习率太低，或者模型架构不合理，可以考虑调整参数或架构，让模型更加强大，从而提高BLEU得分

不知道你这个问题是否已经解决, 如果还没有解决的话:

文章：Transformer 模型的理解与使用一中也许有你想要的答案，请看下吧

如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^

有几个问题可能影响 BLEU 得分：

1、训练集和测试集的质量不够高：BLEU 得分受到输入文本的影响，如果输入文本质量不高，会影响 BLEU 得分。因此，检查训练集和测试集中的文本是否有语法、拼写和其他错误是很重要的。

2、数据预处理的方式不够好：这里您使用的是 BPE Tokenization，但可能存在其他更适合数据集的预处理方式。建议您尝试一下其他方式，例如 wordpiece Tokenization 或 sentencepiece Tokenization 等。

3、模型的超参数不够好：Transformer 模型有很多超参数，例如词汇表大小、层数、隐藏层大小等。这些参数的调整可能对 BLEU 得分有很大影响。建议您对这些参数进行实验，找到最优值。

4、模型训练不够充分：200,000 步可能不足以使模型充分收敛。建议您将训练步数增加到更大的值，例如 500,000 步或更多。

另外，您的代码中使用的是 GPT-2 模型，而不是标准的 Transformer 模型。因此，您可能需要调整一些超参数来适应您的数据集。

建议检查一下训练数据是否有问题，比如语料中单词数量是否够多，是否有脏语料，以及数据是否输入正确等。另外，也要检查一下模型本身是否是正确的，检查trainer的代码参数是否设置正确，以及模型的架构是否合理等。最后，要确认模型训练中没有发生任何异常，而且gpu内存是否足够有效利用。