ALLENNLP多卡训练梯度反传出错

ALLENNLP多卡训练梯度反传出错
问题:
2023-02-01 12:41:57,919 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
  File "/data/yutian/anaconda3/envs/py37/bin/allennlp", line 8, in <module>
    sys.exit(run())
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/__main__.py", line 39, in run
    main(prog="allennlp")
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 120, in main
    args.func(args)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 120, in train_model_from_args
    file_friendly_logging=args.file_friendly_logging,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 186, in train_model_from_file
    return_model=return_model,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 341, in train_model
    nprocs=num_procs,
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 508, in _train_worker
    metrics = train_loop.run()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 581, in run
    return self.trainer.train()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 771, in train
    metrics, epoch = self._try_train()
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 793, in _try_train
    train_metrics = self._train_epoch(epoch)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 510, in _train_epoch
    batch_outputs = self.batch_outputs(batch, for_training=True)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 403, in batch_outputs
    output_dict = self._pytorch_model(**batch)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 221 222
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

梯度反传问题。根据报错描述,有两种解决办法。1. 排查无用参数;2. 设定find_unused_parameters=True。

我的尝试

1 排查无用参数
运行allennlp train之前,设定TORCH_DISTRIBUTED_DEBUG=DETAIL,打印出了未反传参数的名称,如下:

Parameters which did not receive grad for rank 0: word_embedder.token_embedder_bert.transformer_model.pooler.dense.weight, word_embedder.token_embedder_bert.transformer_model.pooler.dense.bias
Parameter indices which did not receive grad for rank 0: 221 222

这两个参数没有参与梯度反传。但是我加载这个预训练模型完全使用了allennlp的config:

"bert": {
                    "type": "pretrained_transformer",
                    "model_name": "/data/yutian/.pytorch_pretrained_bert",
                    "last_layer_only": true,
                }

不太清楚应当在哪里进行修改。
2 设定find_unused_parameters=True
然而我这里完全使用了allennlp train命令,分布式纯属是增加了如下config

"distributed": {
        "cuda_devices": [0, 1, 2, 3]
    },

并没有人工手动调用dataparallel,也不太清楚如何给torch.nn.parallel.DistributedDataParallel传递参数(菜极了)

希望有相似经历的佬们提供一点帮助!

附完整config

{
    "random_seed": 42,
    "numpy_seed": 42,
    "pytorch_seed": 42,
    "dataset_reader": {
        "type": "rewrite",
        "lazy": false,
        "super_mode": "before",
        "joint_encoding": true,
        "use_bert": true,
        "language": "zh",
        "extra_stop_words": ["的", "是", "我", "了", "去"]
    },
    "model": {
        "type": "rewrite",
        "word_embedder": {
            "token_embedders": {
                "bert": {
                    "type": "pretrained_transformer",
                    "model_name": "/data/yutian/.pytorch_pretrained_bert",
                    "last_layer_only": true,
                    // "requires_grad": true
                }
            },
            // "allow_unmatched_keys": true,
            // "embedder_to_indexer_map": {
            //     "bert": [
            //        "bert",
            //        "bert-offsets",
            //        "bert-type-ids"
            //    ]
            // }
        },
        "text_encoder": {
            "type": "pytorch_transformer",
            "input_dim": 1152,
            "num_layers": 2,
            "positional_encoding": "sinusoidal"
        },
        "inp_drop_rate": 0.2,
        "out_drop_rate": 0.2,
        "feature_sel": 83,
        "loss_weights": [0.2, 0.2, 0.6],
        "super_mode": "before",
        "unet_down_channel": 64
    },
    "data_loader": {
        "batch_size": 12,
        "shuffle": true,
        "cuda_device": 1
    },
    "trainer": {
        "run_confidence_checks": 0,
        "num_epochs": 100,
        "patience": 10,
        "validation_metric": "+F3",
        // "cuda_device": 0,
        "optimizer": {
            "type": "adam",
            "parameter_groups": [
                [
                    [
                        ".*word_embedder.*", "text_encoder"
                    ],
                    {
                        "lr": 1e-5
                    }
                ]
            ],
            "lr": 1e-3
        },
        "learning_rate_scheduler": {
            "type": "reduce_on_plateau",
            "factor": 0.5,
            "mode": "max",
            "patience": 5
        },
        "num_serialized_models_to_keep": 10,
        "should_log_learning_rate": true
    },
    "distributed": {
        "cuda_devices": [0, 1, 2, 3]
    },
}

找到allennlp.nn.parallel.ddp_accelerator源码,将find_unused_parameters默认值改成True即可。

@DdpAccelerator.register("torch")
class TorchDdpAccelerator(DdpAccelerator):
 | def __init__(
 |     self,
 |     *, find_unused_parameters: bool = False, # 改成True
 |     *, local_rank: Optional[int] = None,
 |     *, world_size: Optional[int] = None,
 |     *, cuda_device: Union[torch.device, int] = -1
 | ) -> None

目前不太清楚allennlp train的调用过程,理论上应该存在接口可以在不更改库函数的情况下修改这个参数。但是目前问题已经解决。

如果你在使用 AllenNLP 训练模型时遇到了梯度反传错误,可以尝试以下几种解决办法:

减小批次大小,以降低 GPU 内存使用量。

检查模型是否有任何显示或隐式的内存泄漏。

尝试使用更小的模型,以减少 GPU 内存使用量。

尝试使用较少的卡片或使用更高级的 GPU,以提高 GPU 内存容量。

不知道你这个问题是否已经解决, 如果还没有解决的话:

如果你已经解决了该问题, 非常希望你能够分享一下解决方案, 写成博客, 将相关链接放在评论区, 以帮助更多的人 ^-^