pytorch的resnet猫狗大战跑不完不收敛

问题遇到的现象和发生背景

我用这个代码跑了一下猫狗大战数据集,为什么跑到一半就跑不下去了,只能跑到第二代,而且感觉跑得非常慢。

问题相关代码,请勿粘贴截图
# -*- coding: utf-8 -*-
"""
Created on Fri Jul 22 10:26:33 2022

# 11:34第一代
# 11.48第二代
# 12:14 跑到一半不跑了
@author: 19544
"""

import torch
import torch.nn as nn
import torch.utils.data as Data
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

EPOCH=5
BATCH_SIZE=40
LR=0.01

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                              std=[0.229, 0.224, 0.225])   

train_dataset = datasets.ImageFolder(
        'D:\\项目实验文件夹\\猫狗大战数据集\\dogcat_2',
        transforms.Compose([
                transforms.RandomResizedCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                normalize,
                ]))

train_loader = Data.DataLoader(
        train_dataset,
        batch_size=BATCH_SIZE,
        shuffle=True)

test_loader = Data.DataLoader(
        datasets.ImageFolder(
                'D:\\项目实验文件夹\\猫狗大战数据集\\dogcat_2', 
                transforms.Compose([
                        transforms.Resize(256),
                        transforms.CenterCrop(224),
                        transforms.ToTensor(),
                        normalize,
                        ])),
        batch_size=BATCH_SIZE, shuffle=False,)

model = models.resnet18(pretrained=True)
model.fc = torch.nn.Linear(in_features=512, out_features=5, bias=True)

fc_params = list(map(id, model.fc.parameters())) # map函数是将fc.parameters()的id返回并组成一个列表
base_params = filter(lambda p: id(p) not in fc_params, model.parameters()) # filter函数是将model.parameters()中地址不在fc.parameters的id中的滤出来
optimizer = torch.optim.SGD([ {'params': base_params}, {'params': model.fc.parameters(), 'lr': LR * 100}], lr=LR,)
loss_func=nn.CrossEntropyLoss()

class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)
            
def accuracy(output, target, topk=(1,)):
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res           
            
train_losses = AverageMeter('TrainLoss', ':.4e')
train_top1 = AverageMeter('TrainAccuracy', ':6.2f')
test_losses = AverageMeter('TestLoss', ':.4e')
test_top1 = AverageMeter('TestAccuracy', ':6.2f')

for epoch in range(EPOCH):
    
    model.train()
    for i,(images,target) in enumerate(train_loader):
        output=model(images)
        loss= loss_func(output,target)
        
        acc1, = accuracy(output, target, topk=(1,))
        train_losses.update(loss.item(), images.size(0))
        train_top1.update(acc1[0], images.size(0))

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print('Epoch[{}/{}],TrainLoss:{}, TrainAccuracy:{}'.format(epoch,EPOCH,train_losses.val, train_top1.val))
           
    model.eval()
    with torch.no_grad():
        for i,(images,target) in enumerate(test_loader):
            output=model(images)
            loss= loss_func(output,target)
            
            acc1, = accuracy(output, target, topk=(1,))
            test_losses.update(loss.item(), images.size(0))
            test_top1.update(acc1[0], images.size(0))
            
    print('TestLoss:{}, TestAccuracy:{}'.format(test_losses.avg, test_top1.avg))


运行结果及报错内容

Epoch[2/5],TrainLoss:0.7035315036773682, TrainAccuracy:47.5
Epoch[2/5],TrainLoss:0.7905141711235046, TrainAccuracy:47.5
Epoch[2/5],TrainLoss:0.7110738158226013, TrainAccuracy:47.5
Epoch[2/5],TrainLoss:0.709513783454895, TrainAccuracy:47.5
Epoch[2/5],TrainLoss:0.6796354055404663, TrainAccuracy:60.0
Epoch[2/5],TrainLoss:0.6862636804580688, TrainAccuracy:55.0

我的解答思路和尝试过的方法

我尝试过改了下代数,可是还是算得太长,而且正确率不收敛。

我想要达到的结果

希望帮忙让它跑完并收敛。

跑得慢是因为没有使用cuda加速

pytorch如果你安装的GPU版本的话,可以直接设置到cuda训练,入门方法很简单,差不多就是model.cuda(),imput_tensor.cuda(),out=model(imput_tensor)差不多就是这样,把你需要计算的tensor或者模型之类的后面加个.cuda()。至于一些进阶的操作,类似多显卡,分布式啥的我估计你也用不到。前提是正确安装好pytorch,cuda和cudnn;更多一些和cuda相关的操作可以看看文档。

博主的代码没有使用cuda加速,在pytorch中model与data都要调用model.cuda(),data.cuda()将模型与数据迁移到gpu中才能加速训练,这才导致训练较慢。代码其实是收敛的,只是深度学习模型收敛较慢,无法在前几个epoch就达到很高的精度。在一般的深度学习训练中loss与acc是在震荡中下降的。如果博主困与gpu环境,可以考虑使用百度的aistudio,普通用户每周有48小时的gpu使用时间,v100 32g的显卡,但是只能使用paddle框架(在建模与训练方面与pytorch基本上是相同的,可以参考https://hpg123.blog.csdn.net/article/details/122681281)。 aistudio官网地址: https://aistudio.baidu.com/aistudio/personalcenter/thirdview/999543%E3%80%82