首页 > 解决方案 > 使用 PyTorch 在 Adam 模型中获得 nan 损失

问题描述

我是训练神经网络的新手。如果这是一个非常愚蠢的问题或违反了任何未说明的堆栈溢出规则,请原谅我。我最近开始研究泰坦尼克号数据集。我清理了数据。我有一个特征张量,它是通过连接归一化的连续数据和分类数据的一个热张量来制作的。我将这些数据传递到一个简单的线性模型中,并且我在所有时期都得到了 nan 损失。

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm
import pickle
import pathlib

path = pathlib.Path('./drive/My Drive/Kaggle/Titanic')

with open(path/'feature_tensor.pickle', 'rb') as f:
    features = pickle.load(f)

with open(path/'label_tensor.pickle', 'rb') as f:
    labels = pickle.load(f)

features = features.float()
labels = labels.float()

import math
valid_size = -1 * math.floor(0.2*len(features))

train_features = features[:valid_size]
valid_features = features[valid_size:]

train_labels = labels[:valid_size]
valid_labels = labels[valid_size:]

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.h_l1 = nn.Linear(18, 64)
        self.h_l2 = nn.Linear(64, 32)
        self.o_l = nn.Linear(32, 2)

    def forward(self, x):
        x = F.relu(self.h_l1(x))
        x = F.relu(self.h_l2(x))
        return self.o_l(x)

model = Model()
model.to('cuda')

optimizer = optim.Adam(model.parameters())
loss_fn = nn.MSELoss()

EPOCHS = 5
BATCH_SIZE = 20

for EPOCH in range(0, EPOCHS):
    for i in tqdm(range(0, len(features), BATCH_SIZE)):
        train_feature_batch = train_features[i:i+BATCH_SIZE,:].to('cuda')
        train_label_batch = train_labels[i:i+BATCH_SIZE,:].to('cuda')
        valid_feature_batch = valid_features[i:i+BATCH_SIZE,:].to('cuda')
        valid_label_batch = valid_labels[i:i+BATCH_SIZE,:].to('cuda')
        train_loss = loss_fn(model(train_feature_batch), train_label_batch)
        with torch.no_grad():
            valid_loss = loss_fn(model(valid_feature_batch), valid_label_batch)
        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()
    print(f"Epoch : {EPOCH}\tTrain Loss : {train_loss}\tValid_loss : {valid_loss}\n")

我得到以下输出:

100%|██████████| 45/45 [00:00<00:00, 511.50it/s]
100%|██████████| 45/45 [00:00<00:00, 604.10it/s]
100%|██████████| 45/45 [00:00<00:00, 586.21it/s]
  0%|          | 0/45 [00:00<?, ?it/s]Epoch : 0 Train Loss : nan    Valid_loss : nan

Epoch : 1   Train Loss : nan    Valid_loss : nan

Epoch : 2   Train Loss : nan    Valid_loss : nan

100%|██████████| 45/45 [00:00<00:00, 555.55it/s]
100%|██████████| 45/45 [00:00<00:00, 607.65it/s]Epoch : 3   Train Loss : nan    Valid_loss : nan

Epoch : 4   Train Loss : nan    Valid_loss : nan

是的,输出是这样分散的。请帮忙。

标签: deep-learningpytorchadam

解决方案


推荐阅读