2023-02-01 12:41:57,919 - CRITICAL - root - Uncaught exception
Traceback (most recent call last):
File "/data/yutian/anaconda3/envs/py37/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/__main__.py", line 39, in run
main(prog="allennlp")
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 120, in main
args.func(args)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 120, in train_model_from_args
file_friendly_logging=args.file_friendly_logging,
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 186, in train_model_from_file
return_model=return_model,
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 341, in train_model
nprocs=num_procs,
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 508, in _train_worker
metrics = train_loop.run()
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/commands/train.py", line 581, in run
return self.trainer.train()
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 771, in train
metrics, epoch = self._try_train()
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 793, in _try_train
train_metrics = self._train_epoch(epoch)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 510, in _train_epoch
batch_outputs = self.batch_outputs(batch, for_training=True)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/allennlp/training/gradient_descent_trainer.py", line 403, in batch_outputs
output_dict = self._pytorch_model(**batch)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/data/yutian/anaconda3/envs/py37/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 221 222
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
梯度反传问题。根据报错描述,有两种解决办法。1. 排查无用参数;2. 设定find_unused_parameters=True。
1 排查无用参数
运行allennlp train之前,设定TORCH_DISTRIBUTED_DEBUG=DETAIL,打印出了未反传参数的名称,如下:
Parameters which did not receive grad for rank 0: word_embedder.token_embedder_bert.transformer_model.pooler.dense.weight, word_embedder.token_embedder_bert.transformer_model.pooler.dense.bias
Parameter indices which did not receive grad for rank 0: 221 222
这两个参数没有参与梯度反传。但是我加载这个预训练模型完全使用了allennlp的config:
"bert": {
"type": "pretrained_transformer",
"model_name": "/data/yutian/.pytorch_pretrained_bert",
"last_layer_only": true,
}
不太清楚应当在哪里进行修改。
2 设定find_unused_parameters=True
然而我这里完全使用了allennlp train命令,分布式纯属是增加了如下config
"distributed": {
"cuda_devices": [0, 1, 2, 3]
},
并没有人工手动调用dataparallel,也不太清楚如何给torch.nn.parallel.DistributedDataParallel传递参数(菜极了)
附完整config
{
"random_seed": 42,
"numpy_seed": 42,
"pytorch_seed": 42,
"dataset_reader": {
"type": "rewrite",
"lazy": false,
"super_mode": "before",
"joint_encoding": true,
"use_bert": true,
"language": "zh",
"extra_stop_words": ["的", "是", "我", "了", "去"]
},
"model": {
"type": "rewrite",
"word_embedder": {
"token_embedders": {
"bert": {
"type": "pretrained_transformer",
"model_name": "/data/yutian/.pytorch_pretrained_bert",
"last_layer_only": true,
// "requires_grad": true
}
},
// "allow_unmatched_keys": true,
// "embedder_to_indexer_map": {
// "bert": [
// "bert",
// "bert-offsets",
// "bert-type-ids"
// ]
// }
},
"text_encoder": {
"type": "pytorch_transformer",
"input_dim": 1152,
"num_layers": 2,
"positional_encoding": "sinusoidal"
},
"inp_drop_rate": 0.2,
"out_drop_rate": 0.2,
"feature_sel": 83,
"loss_weights": [0.2, 0.2, 0.6],
"super_mode": "before",
"unet_down_channel": 64
},
"data_loader": {
"batch_size": 12,
"shuffle": true,
"cuda_device": 1
},
"trainer": {
"run_confidence_checks": 0,
"num_epochs": 100,
"patience": 10,
"validation_metric": "+F3",
// "cuda_device": 0,
"optimizer": {
"type": "adam",
"parameter_groups": [
[
[
".*word_embedder.*", "text_encoder"
],
{
"lr": 1e-5
}
]
],
"lr": 1e-3
},
"learning_rate_scheduler": {
"type": "reduce_on_plateau",
"factor": 0.5,
"mode": "max",
"patience": 5
},
"num_serialized_models_to_keep": 10,
"should_log_learning_rate": true
},
"distributed": {
"cuda_devices": [0, 1, 2, 3]
},
}
找到allennlp.nn.parallel.ddp_accelerator源码,将find_unused_parameters默认值改成True即可。
@DdpAccelerator.register("torch")
class TorchDdpAccelerator(DdpAccelerator):
| def __init__(
| self,
| *, find_unused_parameters: bool = False, # 改成True
| *, local_rank: Optional[int] = None,
| *, world_size: Optional[int] = None,
| *, cuda_device: Union[torch.device, int] = -1
| ) -> None
目前不太清楚allennlp train的调用过程,理论上应该存在接口可以在不更改库函数的情况下修改这个参数。但是目前问题已经解决。
如果你在使用 AllenNLP 训练模型时遇到了梯度反传错误,可以尝试以下几种解决办法:
减小批次大小,以降低 GPU 内存使用量。
检查模型是否有任何显示或隐式的内存泄漏。
尝试使用更小的模型,以减少 GPU 内存使用量。
尝试使用较少的卡片或使用更高级的 GPU,以提高 GPU 内存容量。
不知道你这个问题是否已经解决, 如果还没有解决的话: