首页 > 解决方案 > PyTorch - RuntimeError:转换:同步失败:cudaErrorIllegalAddress

问题描述

我有一个问题,当我在 Google Colab 上运行我的模型时,我经常收到这个错误:

RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

我不会一直发生。有时它可以正常工作,但又不行。
(如果它有效与否似乎与批量大小有关,请参阅底部的编辑。)

仅在CPU上运行时没有问题,因此似乎与GPU/CUDA有关。

错误回溯显示错误发生在backward().

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-1-c8bbb05fb58a> in <module>()
     65 out   = model(chars, pos)
     66 loss  = F.binary_cross_entropy_with_logits(out, labels)
---> 67 loss.backward()

1 frames
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    164                 products. Defaults to ``False``.
    165         """"""
--> 166         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    167 
    168     def register_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: transform: failed to synchronize: cudaErrorIllegalAddress: an illegal memory access was encountered

一旦发生此错误,我将无法在该运行时执行与 GPU 相关的任何操作。每当我尝试在 GPU 上创建一些张量左右时,我都会收到一条略有不同的错误消息:

RuntimeError: CUDA error: an illegal memory access was encountered

我需要重新启动运行时/笔记本才能在 GPU 上做一些事情(Pytorch 没有尝试过其他框架)。

这是一个应该能够在 Google Colab 中重现该问题的代码片段:

import os
#os.environ['CUDA_LAUNCH_BLOCKING'] = '1'

import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.__version__, torch.cuda.get_device_name(0))
from torch import nn
from torch.nn import functional as F
class MyModel(nn.Module):
    def __init__(self, char_vocab, num_pos, dim, hidden, dropout):
        super().__init__()
        self.emb_char  = nn.Embedding(char_vocab, dim)
        self.cnn1 = nn.Conv1d(dim, dim // 2, kernel_size=3)
        self.bn1  = nn.BatchNorm1d(dim // 2)
        self.cnn2 = nn.Conv1d(dim // 2, dim // 4, kernel_size=2)
        self.bn2  = nn.BatchNorm1d(dim // 4)
        self.pooling = nn.MaxPool1d(2)
        self.lin1 = nn.Linear(11, hidden)
        self.bn3  = nn.BatchNorm1d(hidden)
        self.lin2 = nn.Linear(hidden, dim // 4)
        self.bn4  = nn.BatchNorm1d(dim // 4)



        self.out  = nn.Linear(dim // 4, 1)
        self.drop = nn.Dropout(dropout)

    def forward(self, chars, pos):
        x = self.emb_char(chars).transpose(1, 2)
        x = self.drop(x)

        x = self.cnn1(x)
        x = self.bn1(x)
        x = self.pooling(x)
        x = F.relu(x)

        x = self.cnn2(x)
        x = self.bn2(x)
        x = self.pooling(x).squeeze(-1)
        x = F.relu(x)
        x = self.drop(x)


        y = F.relu(self.lin1(pos))
        y = self.bn3(y)
        y = self.drop(y)
        y = F.relu(self.lin2(y))
        y = self.bn4(y)
        y = self.drop(y)

        return self.out(x+y)

model = MyModel(char_vocab=80, num_pos=11, dim=32, hidden=8, dropout=0.5).to(device)

batch_size = 160000

# inputs
chars = torch.randint(0, 79, (batch_size, 8), device=device)
pos   = torch.rand(batch_size, 11, device=device)

# labels
labels = torch.ones(batch_size, 1, dtype=torch.float, device=device)

# forward
out   = model(chars, pos)
loss  = F.binary_cross_entropy_with_logits(out, labels)

# backward
loss.backward()

Pytorch 版本是1.3.1它运行的 GPU 是P100-PCIE-16GB.

任何想法如何摆脱这个错误?


编辑:

标签: pythondeep-learninggpupytorch

解决方案


推荐阅读