首页 > 解决方案 > 半监督 CNN 模型训练的 Google Colab RAM 问题

问题描述

我正在尝试通过 EfficientNet 上的迁移学习来训练二元分类器。由于我有很多未标记的数据,我使用半监督方法在模型通过每个 epoch 之前生成多个“伪标记”数据。

由于 Colab 有其 RAM 限制,我在每个循环中删除了一些大变量(如 numpy 数组、数据集、数据加载器......),但是每个循环中 RAM 仍然增加,如下图所示。

下面是我的训练循环,它由 3 个主要结构组成:半监督、训练循环、验证循环。我不确定哪个步骤会导致 RAM 在每个时期不断增加。

(1) 半监督部分

for epoch in range(n_epochs):
    print(f"[ Epoch | {epoch + 1:03d}/{n_epochs:03d} ]")
    if do_semi:
        model.eval()
        dataset_0 = []
        dataset_1 = []

        for img in pseudo_loader:
            with torch.no_grad():
                logits = model(img.to(device))
            probs = softmax(logits)

            # Filter the data and construct a new dataset.
            for i in range(len(probs)):
                p = probs[i].tolist()
                idx = p.index(max(p))
                if p[idx] >= threshold:
                    if idx == 0:
                        dataset_0.append(img[i].numpy().reshape(128, 128, 3))
                    else:
                        dataset_1.append(img[i].numpy().reshape(128, 128, 3))
        # stratified sampling with labels
        len_0, len_1 = len(dataset_0), len(dataset_1)
        print('label 0: ', len_0)
        print('label 1: ', len_1)

        # since there may be RAM memory error, restrict to 1000
        if len_0 > 1000:
            dataset_0 = random.sample(dataset_0, 1000)
        if len_1 > 1000:
            dataset_1 = random.sample(dataset_1, 1000)

        if len_0 == len_1:
            pseudo_x = np.array(dataset_0 + dataset_1)
            pseudo_y = ['0' for _ in range(len(dataset_0))] + ['1' for _ in range(len(dataset_1))]

        elif len_0 > len_1:
            dataset_0 = random.sample(dataset_0, len(dataset_1))
            pseudo_x = np.array(dataset_0 + dataset_1)
            pseudo_y = ['0' for _ in range(len(dataset_0))] + ['1' for _ in range(len(dataset_1))]

        else:
            dataset_1 = random.sample(dataset_1, len(dataset_0))
            pseudo_x = np.array(dataset_0 + dataset_1)
            pseudo_y = ['0' for _ in range(len(dataset_0))] + ['1' for _ in range(len(dataset_1))]

        if len(pseudo_x) != 0:
            new_dataset = CustomTensorDataset(pseudo_x, np.array(pseudo_y), 'pseudo')
        else:
            new_dataset = []
        # print how many pseudo label data added
        print('Total number of pseudo labeled data are added: ', len(new_dataset))
        # release RAM
        dataset_0 = None
        dataset_1 = None
        pseudo_x  = None
        pseudo_y  = None
        del dataset_0, dataset_1, pseudo_x, pseudo_y
        gc.collect()
        # Turn off the eval mode.
        model.train()

        concat_dataset = ConcatDataset([train_set, new_dataset])
        train_loader = DataLoader(concat_dataset, batch_size=batch_size, shuffle=True)

我很确定问题发生在半监督部分,因为当没有应用半监督部分时 RAM 使用量没有增加。

谢谢你的帮助!!

标签: deep-learningpytorchconv-neural-networkgoogle-colaboratorysemisupervised-learning

解决方案


推荐阅读