pytorch 1.8.1,两台相同的工作GPU
一早上起来发现训练停了,训练CNN模型时设置100epochs,第97个不知道什么原因触发了迭代警示:
RecursionError: Caught RecursionError in replica 0 on device 0.
并最后报错 RecursionError: maximum recursion depth exceeded while calling a Python object,找到最接近提示报错的语句是torch平行训练程序data_parallel的错误,定位在output.reraise()
for i in range(len(inputs)):
output = results[i]
if isinstance(output, ExceptionWrapper):
output.reraise()
和并行计算最有关联的应该就是我使用了nn.DataParallel,我的问题是为什么前96个epoch没有任何问题,我的猜测是同事使用了GPU最后超内存了,但又不好意思问,求助各位大佬...
附完整报错: 无数个重复下述报错:
Original Traceback (most recent call last):
File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/hjn/miniconda3/envs/py36/lib/python3.6/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
RecursionError: Caught RecursionError in replica 0 on device 0.
最后抛出:
RecursionError: maximum recursion depth exceeded while calling a Python object
:Python默认递归调用深度为1000(即最多递归调用1000次),而程序在运行过程中肯超过最大的递归深度。 可以参考这篇文章,希望对你有帮助:https://blog.csdn.net/qq_41320433/article/details/104299296