首页 > 解决方案 > 使用多 GPU 和两阶段 CNN 模型的问题

问题描述

我设计了一个有两个阶段的 CNN 模型。第一阶段是像RPNFaster RCNN 一样生成提案,第二阶段将这些提案输入到以下部分。

它会导致第二步出错。

根据以下错误信息,似乎第二个输入未正确分配给多 GPU。

但是,该模型使用单个 gpu 工作文件。

  File "/home/f523/guazai/sdb/rsy/cornerPoject/myCornerNet6/exp/train.py", line 212, in run_epoch
    cls, rgr = self.model([proposal, fm], stage='two')
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in forward
    return self.gather(outputs, self.output_device)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 168, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/cuda/comm.py", line 166, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA error: an illegal memory access was encountered

附言

我的模型脚本如下所示。我希望我的两阶段模型可以支持多批次。例如batch size是4,每个img输出128个proposal,所以这里的proposal size是(4*128, 5)

def _stage2(self, xs):
    proposal, fm = xs
    if proposal.dim()==2 and proposal.size(1) == 5:
        # train mode
        roi = roi_align(fm, proposal, output_size=[15, 15])
    elif proposal.dim()==3 and proposal.size(2) == 4:
        # eval mode
        roi = roi_align(fm, [proposal[0]], output_size=[15, 15])
    else:
        assert AssertionError(" The boxes tensor shape should be Tensor[K, 5] in train or Tensor[N, 4] in eval")
    x = self.big_kernel(roi)
    cls = self.cls_fm(x)
    rgr = self.rgr_fm(x)
    return cls, rgr

标签: pythonpytorchconv-neural-network

解决方案


我知道我错在哪里。这是我提供输入的第二阶段

cls, offset = self.model([proposal, fm], stage='two')

proposal是形状为 的 ROI [N, 5],第一个暗点是批次索引。例如,批量大小为 4,索引范围为[0,1,2,3]. 并且fm是特征图。

当我使用像 2 gpu 这样的多 GPU 时。proposaland将fm被分成两个分支并馈入两个 gpu。但是批量索引范围仍然是[0,1,2,3],然后导致索引错误并引发 gpu 错误。

我所做的是在 roi_align 之前添加一行,如下所示:

from torchvision.ops import roi_align
proposal[:, 0] = proposal[:, 0] % fm.size(0) # this make multi-gpu work
roi = roi_align(fm, proposal, output_size=[15, 15])

推荐阅读