首页 > 解决方案 > Validation and training loss per batch and epoch

问题描述

I am using Pytorch to run some deep learning models. I am currently keeping track of training and validation loss per epoch, which is pretty standard. However, what is the best way of going about keeping track of training and validation loss per batch/iteration?

For training loss, I could just keep a list of the loss after each training loop. But, validation loss is calculated after a whole epoch, so I’m not sure how to go about the validation loss per batch. The only thing I can think of is to run the whole validation step after each training batch and keeping track of those, but that seems overkill and a lot of computation.

For example, the training is like this:

for epoch in range(2): # loop over the dataset multiple times

running_loss = 0.0
for i, data in enumerate(trainloader, 0):
    # get the inputs; data is a list of [inputs, labels]
    inputs, labels = data

    # zero the parameter gradients
    optimizer.zero_grad()

    # forward + backward + optimize
    outputs = net(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()

    # print statistics
    running_loss += loss.item()

And for validation loss:

with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()
        # validation loss
        batch_loss = error(outputs.float(), labels.long()).item()
        loss_test += batch_loss
    loss_test /= len(testloader)

The validation loss/test part is done per epoch. I’m looking for a way to get the validation loss per batch, which is my point above.

Any tips?

标签: machine-learningdeep-learningpytorch

解决方案


一个epoch是使模型遍历整个训练集的过程 - 通常分为批次。此外,它往往会被洗牌。另一方面,验证集用于调整训练的超参数,并找出模型对新数据的行为。在这方面,对我来说,评估 atepoch=1/2没有多大意义。因为问题是——无论评估的表现如何epoch=1/2——你能做些什么呢?因为,你不知道它在前半段经历了哪些数据,所以没有办法利用“前半段更好”……记住你的数据可能会被打乱成批次。

因此,我会坚持经典的方法:在整个集合上训练,然后才在另一个集合上进行评估。在某些情况下,由于计算时间的原因,您甚至不允许自己在每个 epoch 评估一次。相反,您将评估每n 个时期。但话又说回来,这将取决于您的数据集大小、从该数据集中进行的采样、批量大小和计算成本。

对于训练损失,您可以跟踪每个更新步骤 每个时期的值。这将使您能够更好地控制您的模型是否独立于验证阶段进行学习。


编辑- 作为不必为每个训练批次运行整个评估集的替代方法,您可以执行以下操作:随机验证并设置与训练集相同的批次大小。

  • len(trainset)//batch_size是每个 epoch 的更新次数
  • len(validset)//batch_size是每个 epoch 允许的评估次数
  • 每列火车都会更新,您可以 批量len(trainset)//len(validset)评估1

这允许您获得len(trainset)//len(validset)每个 epoch 的反馈时间。

如果您将您的训练/有效比率设置为0.1,则len(validset)=0.1*len(trainset),即每个 epoch 进行10次部分评估。


推荐阅读