首页 > 解决方案 > 二元交叉熵是加法函数吗?

问题描述

我正在尝试训练一个机器学习模型,其中损失函数是二进制交叉熵,由于 gpu 的限制,我只能做 4 的批量大小,并且我在损失图中有很多尖峰。所以我想在一些预定义的批量大小(> 4)之后进行反向传播。所以就像我会做 10 次批量大小 4 的迭代来存储损失,在第 10 次迭代之后添加损失并反向传播。它会类似于 40 的批量大小吗?

TL;博士

f(a+b) = f(a)+f(b) 对于二元交叉熵是真的吗?

标签: machine-learningmathpytorchbackpropagationbatchsize

解决方案


f(a+b) = f(a) + f(b) 似乎不是你想要的。这意味着 BCELoss 是附加的,而它显然不是。我认为您真正关心的是是否对于某些索引i

# false
f(x, y) == f(x[:i], y[:i]) + f([i:], y[i:])

是真的?

简短的回答是否定的,因为您缺少一些比例因子。您可能想要的是以下身份

# true
f(x, y) == (i / b) * f(x[:i], y[:i]) + (1.0 - i / b) * f(x[i:], y[i:])

其中b是总批量大小。

这个身份被用作梯度累积方法的动机(见下文)。此外,此恒等式适用于返回每个批次元素的平均损失的任何目标函数,而不仅仅是 BCE。


警告/陷阱:请记住,使用此方法时,批规范的行为不会完全相同,因为它在前向传递期间根据批大小更新其内部统计信息。


实际上,我们可以在内存方面做得更好,而不仅仅是将损失计算为总和,然后进行反向传播。相反,我们可以单独计算等效总和中每个分量的梯度,并允许梯度累积。为了更好地解释,我将给出一些等效操作的示例

考虑以下模型

import torch
import torch.nn as nn
import torch.nn.functional as F

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        num_outputs = 5
        # assume input shape is 10x10
        self.conv_layer = nn.Conv2d(3, 10, 3, 1, 1)
        self.fc_layer = nn.Linear(10*5*5, num_outputs)

    def forward(self, x):
        x = self.conv_layer(x)
        x = F.max_pool2d(x, 2, 2, 0, 1, False, False)
        x = F.relu(x)
        x = self.fc_layer(x.flatten(start_dim=1))
        x = torch.sigmoid(x)   # or omit this and use BCEWithLogitsLoss instead of BCELoss
        return x

# to ensure same results for this example
torch.manual_seed(0)
model = MyModel()
# the examples will work as long as the objective averages across batch elements
objective = nn.BCELoss()
# doesn't matter what type of optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

假设我们的单批次数据和目标是

torch.manual_seed(1)    # to ensure same results for this example
batch_size = 32
input_data = torch.randn((batch_size, 3, 10, 10))
targets = torch.randint(0, 1, (batch_size, 20)).float()

整批

整个批次的训练循环的主体可能看起来像这样

# entire batch
output = model(input_data)
loss = objective(output, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
loss_value = loss.item()

print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))

子批次损失的加权总和

我们可以使用多个损失函数的总和来计算这个

# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)

# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size

loss = 0
for sub_batch_idx in range(num_sub_batches):
    start_idx = sub_batch_size * sub_batch_idx
    end_idx = start_idx + sub_batch_size
    sub_input = input_data[start_idx:end_idx]
    sub_targets = targets[start_idx:end_idx]
    sub_output = model(sub_input)
    # add loss component for sub_batch
    loss = loss + objective(sub_output, sub_targets) / num_sub_batches
optimizer.zero_grad()
loss.backward()
optimizer.step()

loss_value = loss.item()

print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))

梯度累积

前一种方法的问题在于,为了应用反向传播,pytorch 需要将每个子批次的层的中间结果存储在内存中。这最终需要相对大量的内存,您可能仍会遇到内存消耗问题。

为了缓解这个问题,我们可以执行梯度累积,而不是计算单个损失并执行一次反向传播。这给出了先前版本的等效结果。这里的不同之处在于,我们改为对损失的每个分量执行反向传递,只有在所有分量都被反向传播后才步进优化器。这样,计算图在每个子批次后都会被清除,这将有助于内存使用。请注意,这是有效的,因为.backward()实际上将新计算的梯度累积(添加)到每个模型参数的现有.grad成员。这就是为什么optimizer.zero_grad()必须只在循环之前调用一次,而不是在循环期间或之后调用。

# This is simpler if the sub-batch size is a factor of batch_size
sub_batch_size = 4
assert (batch_size % sub_batch_size == 0)

# for this to work properly the batch_size must be divisible by sub_batch_size
num_sub_batches = batch_size // sub_batch_size

# Important! zero the gradients before the loop
optimizer.zero_grad()
loss_value = 0.0
for sub_batch_idx in range(num_sub_batches):
    start_idx = sub_batch_size * sub_batch_idx
    end_idx = start_idx + sub_batch_size
    sub_input = input_data[start_idx:end_idx]
    sub_targets = targets[start_idx:end_idx]
    sub_output = model(sub_input)
    # compute loss component for sub_batch
    sub_loss = objective(sub_output, sub_targets) / num_sub_batches
    # accumulate gradients
    sub_loss.backward()
    loss_value += sub_loss.item()
optimizer.step()

print("Loss value: ", loss_value)
print("Model checksum: ", sum([p.sum().item() for p in model.parameters()]))

推荐阅读