torch中loss.backward()报错，frame #0 CUDA error

Traceback (most recent call last):
File "SH_main.py", line 229, in <module>
loss.backward()
File "/home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA error: an illegal memory access was encountered (copy_kernel_cuda at /opt/conda/conda-bld/pytorch_1587428266983/work/aten/src/ATen/native/cuda/Copy.cu:180)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fa10d398b5e in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x240024f (0x7fa10f9c524f in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x9146ac (0x7fa138a9e6ac in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #3: <unknown function> + 0x911d73 (0x7fa138a9bd73 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #4: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x44 (0x7fa138a9d834 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #5: <unknown function> + 0x2ecd25d (0x7fa13b05725d in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0xb73b43 (0x7fa138cfdb43 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: at::native::to(at::Tensor const&, c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) + 0x6a0 (0x7fa138cfe690 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xe92d2a (0x7fa13901cd2a in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x291074e (0x7fa13aa9a74e in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xdd4282 (0x7fa138f5e282 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: at::Tensor::to(c10::TensorOptions const&, bool, bool, c10::optional<c10::MemoryFormat>) const + 0x15c (0x7fa13e3807bc in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #12: torch::autograd::CopyBackwards::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x47e (0x7fa13ac7c4ce in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #13: <unknown function> + 0x2ae8215 (0x7fa13ac72215 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #14: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fa13ac6f513 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #15: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7fa13ac702f2 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #16: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fa13ac68969 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #17: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fa13dfaf558 in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0xc819d (0x7fa140a0119d in /home/ydd/anaconda3/envs/py3/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #19: <unknown function> + 0x7e25 (0x7fa160761e25 in /lib64/libpthread.so.0)
frame #20: clone + 0x6d (0x7fa16048f34d in /lib64/libc.so.6)

请问这是什么原因呢，模型以及模型需要的外部参数都已经放在了GPU上了

GPU也够用

似乎是你修改了某个tensor的data，导致autograd engine没法backprop

谢谢您的回答，我只对label做了labelsmoothing，其他的tensor没有改动。

另外，我的模型可以跑两个epoch，不过第三个epoch就突然报错，报错如下：

我的代码：

显示losses添加的部分错误，不过别的代码这样写运行时没问题的，有时候这个代码也能运行出来。不清楚为什么