我在尝试复现wav2vec的自述文件中训练wav2vec 2.0 base 模型的部分https://github.com/facebookresearch/fairseq/tree/main/examples/wav2vec#wav2vec-20,完全按照自述文件中给出的命令运行,但是在尝试用自己的数据集训练模型的时候只能走到这一步:
2023-06-30 08:41:55 | INFO | fairseq_cli.train | max tokens per GPU = 1000000 and batch size per GPU = None
2023-06-30 08:41:56 | INFO | fairseq.trainer | loaded checkpoint /root/autodl-tmp/wav2vec_small.pt (epoch 575 @ 0 updates)
2023-06-30 08:41:57 | INFO | fairseq.trainer | loading train data for epoch 1
2023-06-30 08:41:57 | INFO | fairseq.data.audio.raw_audio_dataset | loaded 329108, skipped 26424 samples
2023-06-30 08:41:57 | INFO | fairseq.trainer | begin training epoch 1
之后就会出现错误信息:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.HalfTensor [8, 768, 257]], which is output 0 of AsStridedBackward0, is at version 2; expected version 1 instead.
完整的Traceback如下所示:
/root/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py:173: UserWarning: Error detected in ConvolutionBackward0. Traceback of forward call that caused the error:
File "/root/miniconda3/bin/fairseq-hydra-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-hydra-train')())
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/hydra_train.py", line 66, in cli_main
hydra_main()
File "/root/miniconda3/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
run_and_report(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
lambda: hydra.run(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
return run_job(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/core/utils.py", line 129, in run_job
ret.return_value = task_function(task_cfg)
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/hydra_train.py", line 38, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "/root/autodl-tmp/libs/fairseq/fairseq/distributed_utils.py", line 336, in call_main
main(cfg, **kwargs)
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/train.py", line 138, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/root/miniconda3/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/train.py", line 235, in train
log_output = trainer.train_step(samples)
File "/root/miniconda3/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/root/autodl-tmp/libs/fairseq/fairseq/trainer.py", line 530, in train_step
loss, sample_size_i, logging_output = self.task.train_step(
File "/root/autodl-tmp/libs/fairseq/fairseq/tasks/fairseq_task.py", line 428, in train_step
loss, sample_size, logging_output = criterion(model, sample)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/libs/fairseq/fairseq/criterions/wav2vec_criterion.py", line 52, in forward
net_output = model(**sample["net_input"])
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/libs/fairseq/fairseq/models/wav2vec/wav2vec2.py", line 502, in forward
x = self.encoder(x, padding_mask=padding_mask)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/autodl-tmp/libs/fairseq/fairseq/models/wav2vec/wav2vec2.py", line 730, in forward
x = self.extract_features(x, padding_mask)
File "/root/autodl-tmp/libs/fairseq/fairseq/models/wav2vec/wav2vec2.py", line 742, in extract_features
x_conv = self.pos_conv(x.transpose(1, 2))
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/container.py", line 141, in forward
input = module(input)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
result = forward_call(*input, **kwargs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 302, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/root/miniconda3/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 298, in _conv_forward
return F.conv1d(input, weight, bias, self.stride,
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:104.)
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
File "/root/miniconda3/bin/fairseq-hydra-train", line 33, in <module>
sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-hydra-train')())
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/hydra_train.py", line 66, in cli_main
hydra_main()
File "/root/miniconda3/lib/python3.8/site-packages/hydra/main.py", line 32, in decorated_main
_run_hydra(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/utils.py", line 346, in _run_hydra
run_and_report(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/utils.py", line 201, in run_and_report
raise ex
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/utils.py", line 198, in run_and_report
return func()
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/utils.py", line 347, in <lambda>
lambda: hydra.run(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 107, in run
return run_job(
File "/root/miniconda3/lib/python3.8/site-packages/hydra/core/utils.py", line 129, in run_job
ret.return_value = task_function(task_cfg)
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/hydra_train.py", line 38, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "/root/autodl-tmp/libs/fairseq/fairseq/distributed_utils.py", line 336, in call_main
main(cfg, **kwargs)
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/train.py", line 138, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/root/miniconda3/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/root/autodl-tmp/libs/fairseq/fairseq_cli/train.py", line 235, in train
log_output = trainer.train_step(samples)
File "/root/miniconda3/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "/root/autodl-tmp/libs/fairseq/fairseq/trainer.py", line 562, in train_step
raise e
File "/root/autodl-tmp/libs/fairseq/fairseq/trainer.py", line 530, in train_step
loss, sample_size_i, logging_output = self.task.train_step(
File "/root/autodl-tmp/libs/fairseq/fairseq/tasks/fairseq_task.py", line 432, in train_step
optimizer.backward(loss)
File "/root/autodl-tmp/libs/fairseq/fairseq/optim/fp16_optimizer.py", line 104, in backward
loss.backward()
File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
请问为什么会出现这样的问题,我该怎么解决这个问题?
我所用的操作系统如下所示:
PyTorch 1.11.0
Python 3.8(ubuntu20.04)
Cuda 11.3
GPU: RTX 4090(24GB) * 1
CPU: 25 vCPU AMD EPYC 7T83 64-Core Processor
fairseq 1.0.0a0+c8a0659
以下有几种情况
1)找到网络模型中的 inplace 操作,将inplace=True改成 inplace=False,例如torch.nn.ReLU(inplace=False)
2)将代码中的“a+=b”之类的操作改为“c = a + b”,a=b改成a=ab,a/=b改成a=a/b。注意认真找。
3)训练代码的optimizer.step()函数放到loss.backward()后面
总不能帮你改代码吧,针对你这个完整的报错信息,有以下几个方案:
1.首先,确保你的PyTorch版本是最新的。你可以在PyTorch官方网站上找到最新版本,并按照官方文档中的说明进行安装。
2.如果你使用的是GPU版本的PyTorch,请确保你的CUDA版本与PyTorch兼容。你可以在PyTorch官方网站上找到CUDA与PyTorch之间的兼容性矩阵,并确保你的CUDA版本符合要求。
3.如果你使用的是conda环境,请尝试在一个干净的conda环境中重新安装PyTorch。
4.如果上述步骤都没有解决问题,那么可能是由于损坏的PyTorch库文件引起的。你可以尝试卸载并重新安装PyTorch。
数据集中的样本格式不正确,导致在计算反向传播时出错,PyTorch版本和fairseq版本不匹配,存在兼容问题,GPU存在问题,导致计算无法正常完成等,你最好检查下
我遇到训练Word2Vec报错:RuntimeError:是这么处理的
在定义模型时
model = word2vec.Word2Vec(sentences, min_count=5)
1
有一个min_count的属性,它的默认值是5,Word2Vec在训练时会忽略词频小于该属性值的词,发生这个报错是因为你传入的此列表中的所有词的词频都小于这个值。
解决方案
减小该值的设定
one of the variables needed for gradient computation has been modified by an inplace operation这个错误的大概意思就是,在计算梯度的时候检查出某个Variable有被一个 inplace operation 修改。照着提示信息,设置:torch.autograd.set_detect_anomaly = True,再运行一下,可以得到更详细的错误输出。
解决方法就是:
对于.add_()方法,是直接在tensor上进行修改的,可以把x.add_(y)改成x = x + y。如果需要复制一个副本话,可以使用.clone()方法。
inplace operation 可以是一些 += 或 *= 导致的。比如 x += y,需要改成 x = x +y
检查一下版本是否匹配
引用chatgpt内容作答:
这个问题可能是由于自动混合精度(Automatic Mixed Precision)设置不正确引起的。自动混合精度是一种技术,可以在训练过程中使用较低的计算精度来加速模型训练。然而,有些操作不支持原地修改(inplace operation),这可能导致计算图中的版本不匹配。
为了解决这个问题,您可以尝试以下方法:
禁用自动混合精度:在训练代码中,找到与自动混合精度相关的代码,并将其禁用。您可以将 --fp16 选项设置为 False,或者在代码中找到自动混合精度相关的设置并将其注释掉。
使用全精度训练:将训练过程中使用的张量类型从半精度(half precision)改为全精度(full precision)。您可以将所有与张量类型有关的代码中的 torch.cuda.HalfTensor 替换为 torch.cuda.FloatTensor。
这些方法将使您的模型在全精度下运行,但可能会稍微减慢训练速度。如果您希望继续使用半精度训练,您可能需要检查自动混合精度的设置并确保正确使用支持半精度操作的函数和模块。fairseq团队可能已经更新了代码库来修复这个问题。您可以尝试使用最新版本的fairseq,以查看是否已经修复了该错误。
另外,这个错误通常是由于原地修改(inplace operation)变量导致的。从错误信息中可以看到,错误是在卷积操作中发生的。这可能是因为您使用的PyTorch版本与fairseq的要求不兼容。
根据您提供的信息,您正在使用PyTorch 1.11.0和fairseq 1.0.0a0+c8a0659。然而,根据您所引用的自述文件链接,该自述文件是基于fairseq v0.10.2编写的。
我建议您升级fairseq版本以与您的PyTorch版本兼容。请尝试使用以下命令安装fairseq的最新版本:
pip install fairseq==0.10.2
如果您已经安装了旧版本的fairseq,请先卸载它:
pip uninstall fairseq
安装最新版本后,再次尝试训练模型,看看是否还会出现相同的错误。