首页 > 解决方案 > 使用 CUDA 训练期间的运行时错误:图上的边缘条件卷积

问题描述

我对 Python 比较陌生,目前正在尝试在特定的神经网络中使用 CUDA:Edge-Conditioned Convolution on Graphs,代码可以在这里找到https://github.com/mys007/ecc

我知道有几个像我这样的问题,但我无法解决我的问题。

我想用 CUDA 训练一个数据集,但是在训练(随机)纪元期间该过程停止,并出现以下错误:

File "./main.py", line 317, in <module>
main()
File "./main.py", line 219, in main
acc_train, loss, t_loader, t_trainer = train(epoch)
File "./main.py", line 150, in train
outputs = model(inputs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/ECC_Test/models.py", line 105, in forward
input = module(input)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/workspace/ECC_Test/ecc/GraphConvModule.py", line 173, in forward
return GraphConvFunction(self._in_channels, self._out_channels, idxn, idxe, degs, degs_gpu, self._edge_mem_limit)(input, weights)
File "/workspace/ECC_Test/ecc/GraphConvModule.py", line 69, in forward
cuda_kernels.conv_aggregate_fw(output.narrow(0,startd,numd), products.view(-1,self._out_channels), self._degs_gpu.narrow(0,startd,numd))
File "/workspace/ECC_Test/ecc/cuda_kernels.py", line 122, in conv_aggregate_fw
block=(CUDA_NUM_THREADS,1,1), grid=(GET_BLOCKS(w),n//blockDimY+1,1), stream=stream)            
File "cupy/cuda/function.pyx", line 148, in cupy.cuda.function.Function.__call__
File "cupy/cuda/function.pyx", line 130, in cupy.cuda.function._launch
File "cupy/cuda/driver.pyx", line 228, in cupy.cuda.driver.launchKernel
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Exception ignored in: 'cupy.cuda.function.Module.__dealloc__'
Traceback (most recent call last):
File "cupy/cuda/driver.pyx", line 192, in cupy.cuda.driver.moduleUnload
File "cupy/cuda/driver.pyx", line 81, in cupy.cuda.driver.check_status
cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Traceback 是“CUDA_LAUNCH_BLOCKING=1”

切换到 CPU 并停用 CUDA 工作正常。

我正在使用 SSH 访问具有 4 Nvidia Tesla V100 32GB 驱动程序版本 410.104 的服务器。安装了 CUDA 10.1 和 Python 3.6.8。

目前 Pytorch 是 1.1。较高的 PyTorch 版本是否会导致与 CUDA 10.1 结合使用的问题?还是我在 GPU 上的内存不足?

标签: pythonpython-3.xconv-neural-networkpytorch

解决方案


推荐阅读