python - pytorch 模型在第一轮后返回 NAN
问题描述
这是我第一次编写基于 Pytorch 的 CNN。我终于让代码运行到为第一个数据批次产生输出的地步,但在第二个批次上产生nan
s。出于调试目的,我大大简化了模型,但它仍然无法正常工作。这里显示的模型只是几个具有线性输出的全连接层。
我猜问题出在反向传播步骤,但我不清楚在哪里以及为什么。
这是仍然产生错误的模型的一个非常简化的版本:
数据加载器:
batch_size = 36
device = 'cuda'
# note "rollaxis" to move channel from last to first dimension
# X_train is n input images x 70 width x 70 height x 3 channels
# Y_train is n doubles
torch_train = utils.TensorDataset(torch.from_numpy(np.rollaxis(X_train, 3, 1)).float(), torch.from_numpy(Y_train).float())
train_loader = utils.DataLoader(torch_train, batch_size=batch_size, shuffle=True)
定义和创建模型:
def MyCNN(**kwargs):
return MyCNN_model_simple(**kwargs)
# switched from Sequential() style to assist debugging
class MyCNN_model_simple(nn.Module):
def __init__(self, **kwargs):
super(MyCNN_model_simple, self).__init__()
self.fc1 = FullyConnected( 3 * 70 * 70, 100)
self.fc2 = FullyConnected( 100, 100)
self.last = nn.Linear(100, 1)
# self.net = nn.Sequential(
# self.fc1,
# self.fc2,
# self.last,
# nn.Flatten()
# )
def forward(self, x):
print(f"x shape A: {x.shape}")
x = torch.flatten(x, 1)
print(f"x shape B: {x.shape}")
x = self.fc1(x)
print(f"x shape C: {x.shape}")
x = self.fc2(x)
print(f"x shape D: {x.shape}")
x = self.last(x)
print(f"x shape E: {x.shape}")
x = torch.flatten(x)
print(f"x shape F: {x.shape}")
return x
# return self.net(x)
class FullyConnected(nn.Module):
def __init__(self, in_channels, out_channels, dropout=None):
super(FullyConnected, self).__init__()
layers = []
layers.append(nn.Linear(in_channels, out_channels, bias=True))
layers.append(nn.ReLU())
if dropout != None:
layers.append(nn.Dropout(p=dropout))
self.net = nn.Sequential(*layers)
def forward(self, x):
return self.net(x)
model = MyCNN()
# convert to 16-bit half-precision to save memory
model.half()
model.to(torch.device('cuda'))
运行模型:
loss_fn = nn.MSELoss()
dev = torch.device('cuda')
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
losses = []
max_batches = 2
def process_batch():
inputs = images.half().to(dev)
values = scores.half().to(dev)
# clear accumulated gradients
optimizer.zero_grad()
# make predictions
outputs = model(inputs)
# calculate and save the loss
model_out = torch.flatten(outputs)
print(f"Outputs: {model_out}")
loss = loss_fn(model_out.half(), torch.flatten(values))
losses.append( loss.item() )
# backpropogate the loss
loss.backward()
# adjust parameters to computed gradients
optimizer.step()
model.train()
i = 0
for images, scores in train_loader:
process_batch()
i += 1
if i > max_batches: break
标准输出:
x shape A: torch.Size([36, 3, 70, 70])
x shape B: torch.Size([36, 9800])
x shape C: torch.Size([36, 100])
x shape D: torch.Size([36, 100])
x shape E: torch.Size([36, 1])
x shape F: torch.Size([36])
Outputs: tensor([0.0406, 0.0367, 0.0446, 0.0529, 0.0406, 0.0391, 0.0397, 0.0391, 0.0415,
0.0443, 0.0410, 0.0406, 0.0349, 0.0396, 0.0368, 0.0401, 0.0343, 0.0419,
0.0428, 0.0385, 0.0345, 0.0431, 0.0287, 0.0328, 0.0309, 0.0416, 0.0473,
0.0352, 0.0422, 0.0375, 0.0428, 0.0345, 0.0368, 0.0319, 0.0365, 0.0382],
device='cuda:0', dtype=torch.float16, grad_fn=<AsStridedBackward>)
x shape A: torch.Size([36, 3, 70, 70])
x shape B: torch.Size([36, 9800])
x shape C: torch.Size([36, 100])
x shape D: torch.Size([36, 100])
x shape E: torch.Size([36, 1])
x shape F: torch.Size([36])
Outputs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
device='cuda:0', dtype=torch.float16, grad_fn=<AsStridedBackward>)
x shape A: torch.Size([36, 3, 70, 70])
x shape B: torch.Size([36, 9800])
x shape C: torch.Size([36, 100])
x shape D: torch.Size([36, 100])
x shape E: torch.Size([36, 1])
x shape F: torch.Size([36])
Outputs: tensor([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
device='cuda:0', dtype=torch.float16, grad_fn=<AsStridedBackward>)
您可以看到nan
从第二批开始从模型中出来的 s。我做的有什么明显的错误吗?如果有人有关于调试 pytorch 模块运行的最佳实践的提示,我可以用来追踪问题,那将非常有帮助。
谢谢。
解决方案
您应该在更新梯度时切换到全精度,在训练时切换到半精度
loss.backward()
model.float() # add this here
optimizer.step()
切换回半精度
for images, scores in train_loader:
model.half() # add this here
process_batch()
推荐阅读
- javascript - 根据输入更改innerHTML
- go - 如何表示 GORM 模型的 inet 列类型的 postgres?
- android - 如何从观察发布者中设置不同的字段?
- javascript - 如何创建对 2 个不同的 java 脚本文件的依赖,类似于 TestNG 中的依赖方法?
- javascript - 为什么我的猫头鹰轮播项目是垂直排序而不是水平排序的?
- c# - 使用 for 循环后的操作
- spring - 修复 JPA 实体属性上的字符串约束
- python-3.x - TypeError: 'str' object is not callable 发生错误 - 为什么?
- html - 如何动态切换 jinja 过滤器
- sql - 如何在 BigQuery 中构建按月和设备类别划分的基于用户的渠道?