单机多卡训练RuntimeError: Expected all tensors to be on the same device, but found at least two devices

请问 想使用deepspeed进行单机多卡的finetune时,报错:

Traceback (most recent call last):
  File "main.py", line 440, in <module>
    main()
  File "main.py", line 397, in main
    perplexity = evaluation(model, eval_dataloader)
  File "main.py", line 323, in evaluation
    outputs = model(**batch)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1695, in forward
    loss = self.module(*inputs, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 913, in forward
    transformer_outputs = self.transformer(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/transformers/models/bloom/modeling_bloom.py", line 730, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 2213, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

然后另一块卡报的就是

cuda:1 and cpu!

查了两天了,知道是因为一部分数据在cpu上,一部分在gpu上,但是哪一块在cpu上,我又怎么挪过去呢?

到最终报错的torch/nn/functional.py里,想打印input和weight在cpu还是gpu上:

print(input.untyped_storage())
print(weight.untyped_storage())

input显示torch.storage.UntypedStorage(device=cpu) of size 4096]
weight显示CUDA error: an illegal memory access was encountered

这看起来就好像是input在cpu里,
当我想通过input.cuda()把他放到gpu里时,又报错 CUDA error: an illegal memory access was encountered
所以他到底是在cpu还是gpu里呢

想问佬佬们我要怎么排查到底是哪一步这么操作呢

这个错误提示表明你的某些 tensors 不在同一个设备上,具体来说,是你的输入数据(input_ids 和 attention_mask)在 CPU 上,而模型参数在 GPU 上。这可能是因为在数据加载时没有正确设置数据和模型的设备。

为了解决此问题,你可以采用以下方法:

  1. 在数据加载前将数据移动到相应的设备上,例如:
input_ids = input_ids.to(device)
attention_mask = attention_mask.to(device)

其中,device 是你要将数据移动的设备,可以使用 torch.device() 函数来指定。

  1. 检查你的数据加载和分布式训练设置是否正确。如果你的代码包含分布式训练,那么在初始化训练器(如 DeepSpeed 或 PyTorch Lightning)时,需要指定正确的分布式参数和设备参数。

  2. 当出现 CUDA 错误时,可以通过增加调试语句来确定错误发生的位置。你可以在产生错误的地方前后打印中间结果,找到最先出现错误的位置。在你的示例代码中,你可以尝试在 forward 函数中打印每个输入的设备信息:

for key in batch:
    print(f"{key} device: {batch[key].device}")
outputs = model(**batch)
  1. 最后,检查是否有其他设备操作,如在 GPU 上训练一个模型并同时使用 CPU 进行其他任务。如果存在这种情况,可能会导致资源不足的问题。

希望这些方法可以帮助你解决问题。AI作答