首页 > 解决方案 > PyTorch MNIST 示例不收敛

问题描述

我正在编写一个执行 MNIST 分类的玩具示例。这是我的示例的完整代码:

import matplotlib
matplotlib.use("Agg")
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader

import torchvision.transforms as transforms
import torchvision.datasets as datasets

import matplotlib.pyplot as plt
import os
from os import system, listdir
from os.path import join, isfile, isdir, dirname

def img_transform(image):
    transform=transforms.Compose([
        # transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
    return transform(image)


def normalize_output(img):
    img = img - img.min()
    img = img / img.max()
    return img

def save_checkpoint(state, filename='checkpoint.pth.tar'):
    torch.save(state, filename)

class Net(nn.Module):
    """docstring for Net"""
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
data_images, data_labels = torch.load("./PATH/MNIST/processed/training.pt")
model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-2)
epochs = 5
batch_size = 30
num_batch = int(data_images.shape[0] / batch_size)
for epoch in range(epochs):
    for batch_idx in range(num_batch):
        data = data_images[ batch_idx*batch_size : (batch_idx+1)*batch_size ].float()
        label = data_labels[ batch_idx*batch_size : (batch_idx+1)*batch_size ]
        data = img_transform(data)
        data = data.unsqueeze_(1)
        pred_score = model(data)
        loss = criterion(pred_score, label)
        loss.backward()
        optimizer.step()
        if batch_idx % 200 == 0:
            print('epoch', epoch, batch_idx, '/', num_batch, 'loss', loss.item())
            _, pred = pred_score.topk(1)
            pred = pred.t().squeeze()
            correct = pred.eq(label)
            num_correct = correct.sum(0).item()
            print('acc=', num_correct/batch_size)

dict_to_save = {
    'epoch': epochs,
    'state_dict': model.state_dict(),
    'optimizer' : optimizer.state_dict(),
    }
ckpt_file = 'a.pth.tar'
save_checkpoint(dict_to_save, ckpt_file)
print('save to ckpt_file', ckpt_file)
exit()

该代码可以使用保存在路径中的 MNIST 数据集执行./PATH/MNIST/processed/training.pt

但是,训练过程并不收敛,训练准确率总是低于 0.2。我的实施有什么问题?我尝试了不同的学习率和批量大小。它不起作用。

我的代码中还有其他问题吗?

这是一些训练日志

epoch 0 0 / 2000 loss 27.2023868560791
acc= 0.1
epoch 0 200 / 2000 loss 2.3346288204193115
acc= 0.13333333333333333
epoch 0 400 / 2000 loss 2.691042900085449
acc= 0.13333333333333333
epoch 0 600 / 2000 loss 2.6452369689941406
acc= 0.06666666666666667
epoch 0 800 / 2000 loss 2.7910964488983154
acc= 0.13333333333333333
epoch 0 1000 / 2000 loss 2.966330051422119
acc= 0.1
epoch 0 1200 / 2000 loss 3.111387014389038
acc= 0.06666666666666667
epoch 0 1400 / 2000 loss 3.1988155841827393
acc= 0.03333333333333333

标签: deep-learningneural-networkpytorchconv-neural-network

解决方案


我发现至少有四个问题会影响您获得的结果:

  1. 您需要将梯度归零,例如:
optimizer.zero_grad()
loss.backward()
optimizer.step()
  1. 你在nn.CrossEntropyLoss()F.softmax。它需要logits。删除这个:
output = F.log_softmax(x, dim=1)
  1. 您在打印时仅计算当前批次的损失和累积。所以,这不是正确的结果。要解决它,您需要存储所有损失/accs 并在打印前计算平均值,例如:
# During the loop
loss_value += loss.item()

# When printing:
print(loss_value/number_of_batch_losses_stored)
  1. 这不是一个大问题,但我会说这个学习率应该更小,例如:1e-3.

作为改进管道的提示,最好使用 aDataLoader来加载数据。看看torch.utils.data以了解如何做到这一点。由于您没有使用生成器,因此以您正在执行的方式加载批次效率不高。此外,MNIST 已经在torchvision.datasets.MNIST. 如果您从那里加载数据,它将为您节省一些时间。


推荐阅读