首页 > 解决方案 > 我的代码不能在 PyTorch 中的多个 GPU 上运行

问题描述

我使用 PyTorch 的 Faster R-CNN 在数据集上对其进行训练。它适用于一个 GPU。但是,我可以访问具有 4 个 GPU 的系统。我想使用 4 个 GPU。但是,当我检查 GPU 使用情况时,只使用了一个 GPU。

我以这种方式选择设备:

if torch.cuda.is_available() == False and device_name == 'gpu':
    raise ValueError('GPU is not available!')
elif device_name == 'cpu':
    device = torch.device('cpu')
elif device_name == 'gpu':

    if batch_size % torch.cuda.device_count() != 0:
        raise ValueError('Batch Size is no dividable by number of gpus')
    device = torch.device('cuda')

之后我这样做:

# multi GPUs
    if torch.cuda.device_count() > 1 and device_name == 'gpu':
        print('=' * 50)
        print('=' * 50)
        print('=' * 50)
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
        # model = nn.DataParallel(model, device_ids=[i for i in range(torch.cuda.device_count())])
        model = nn.DataParallel(model)
        print('=' * 50)
        print('=' * 50)
        print('=' * 50)

    # transfer model to selected device
    model.to(device)

我以这种方式将数据移动到设备:

# iterate over all batches
    counter_batches = 0
    for images, targets in metric_logger.log_every(data_loader, print_freq, header):

        # transfer tensors to device(gpu, if not available cpu)
        images = list(image.to(device) for image in images)
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        # in train mode, faster r-cnn gives losses
        loss_dict = model(images, targets)

        # sum of losses
        losses = sum(loss for loss in loss_dict.values())

我不知道我做错了什么。

此外,我收到此警告:

/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all ’

标签: deep-learningpytorchgpuvision

解决方案


推荐阅读