RecursionError: Caught RecursionError in replica 0

pytorch 1.8.1,两台相同的工作GPU

一早上起来发现训练停了,训练CNN模型时设置100epochs,第97个不知道什么原因触发了迭代警示:

RecursionError: Caught RecursionError in replica 0 on device 0.

并最后报错 RecursionError: maximum recursion depth exceeded while calling a Python object,找到最接近提示报错的语句是torch平行训练程序data_parallel的错误,定位在output.reraise()

for i in range(len(inputs)):
    output = results[i]
    if isinstance(output, ExceptionWrapper):
        output.reraise()

和并行计算最有关联的应该就是我使用了nn.DataParallel,我的问题是为什么前96个epoch没有任何问题,我的猜测是同事使用了GPU最后超内存了,但又不好意思问,求助各位大佬...

附完整报错: 无数个重复下述报错:

Original Traceback (most recent call last):
  File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
    raise self.exc_type(msg)
RecursionError: Caught RecursionError in replica 0 on device 0.

最后抛出:

RecursionError: maximum recursion depth exceeded while calling a Python object

:Python默认递归调用深度为1000(即最多递归调用1000次),而程序在运行过程中肯超过最大的递归深度。 可以参考这篇文章,希望对你有帮助:https://blog.csdn.net/qq_41320433/article/details/104299296