看了您的研究方向,有个程序的bug需要向您请教一下:
环境为ubuntu22.04,NVIDIA 1080Ti, nvidia-tensorflow1.15+python3.8
以下错误是在运行深度强化学习算法时出现的,寻找了各种原因和解决方法最后还是报这个错误,包括重新安装cuda,cudnn和显卡驱动;更换tensorflow版本
2023-06-08 15:01:20.373919: W tensorflow/core/distributed_runtime/rpc/grpc_worker_service.cc:510] RecvTensor cancelled for 125519436669088645
Process Process-6:
Traceback (most recent call last):
File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1349, in _run_fn
return self._call_tf_sessionrun(options, feed_dict, fetch_list,
File "/home/mxm/.local/lib/python3.8/site-packages/tensorflow_core/python/client/session.py", line 1441, in _call_tf_sessionrun
return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict,
tensorflow.python.framework.errors_impl.AbortedError: From /job:train/replica:0/task:0:
The same RecvTensor (GrpcWorker) request was received twice. step_id: 125519436669088645 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;081dae2cbbb59408;/job:train/replica:0/task:0/device:GPU:0;edge_52_pred_0/c1/bias/read;0:0" request_id: -8313387993211503819
Additional GRPC error information:
{"created":"@1686207680.373235000","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"The same RecvTensor (GrpcWorker) request was received twice. step_id: 125519436669088645 rendezvous_key: "/job:ps/replica:0/task:0/device:GPU:0;081dae2cbbb59408;/job:train/replica:0/task:0/device:GPU:0;edge_52_pred_0/c1/bias/read;0:0" request_id: -8313387993211503819","grpc_status":10}
[[{{node pred_0/c1/bias/read}}]]
你遇到的问题似乎与 TensorFlow 的分布式训练有关。以下是可能的解决方法:
确保所有工作节点(worker)和参数服务器(parameter server)都正常运行。如果其中一台机器出现问题,那么你可能会遇到这种错误。这可能包括网络问题,如防火墙设置或网络延迟,也可能是硬件问题。
尝试减少 batch size 或者减少模型的复杂性,以降低 GPU 的负载。有时候,当 GPU 处理的数据量过大,或者在运行非常复杂的模型时,可能会出现类似的问题。
避免使用过时的 TensorFlow API,确保你使用的 API 与你的 TensorFlow 版本一致。
如果问题依然存在,可能需要考虑将你的 TensorFlow 更新到最新版本,或者切换到其他深度学习框架,比如 PyTorch。
这是一个非常具体的错误,并且可能与你的特定设置有关,因此你可能需要一些时间来调试和解决。