首页 > 解决方案 > RuntimeError:cuda 运行时错误(48):没有内核映像可用于在 mmdet/ops/roi_a lign/src/roi_align_kernel.cu:139 的设备上执行

问题描述

我在谷歌计算引擎虚拟机上使用我的代码时遇到了一点麻烦。

我正在尝试运行一个小烧瓶 API 来检测图像中的表格。初始化检测器模型有效,但是当我尝试检测表时会发生此错误:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.5/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.5/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "ElvyCascadeTabNetAPI.py", line 36, in detect_tables
    result = inference_detector(model, "temp.jpg")
  File "/SingleModelTest/src/mmdet/mmdet/apis/inference.py", line 86, in inference_detector
    result = model(return_loss=False, rescale=True, **data)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/SingleModelTest/src/mmdet/mmdet/core/fp16/decorators.py", line 49, in new_func
    return old_func(*args, **kwargs)
  File "/SingleModelTest/src/mmdet/mmdet/models/detectors/base.py", line 149, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/SingleModelTest/src/mmdet/mmdet/models/detectors/base.py", line 130, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/SingleModelTest/src/mmdet/mmdet/models/detectors/cascade_rcnn.py", line 342, in simple_test
    x[:len(bbox_roi_extractor.featmap_strides)], rois)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/SingleModelTest/src/mmdet/mmdet/core/fp16/decorators.py", line 127, in new_func
    return old_func(*args, **kwargs)
  File "/SingleModelTest/src/mmdet/mmdet/models/roi_extractors/single_level.py", line 105, in forward
    roi_feats_t = self.roi_layers[i](feats[i], rois_)
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/SingleModelTest/src/mmdet/mmdet/ops/roi_align/roi_align.py", line 144, in forward
    self.sample_num, self.aligned)
  File "/SingleModelTest/src/mmdet/mmdet/ops/roi_align/roi_align.py", line 36, in forward
    spatial_scale, sample_num, output)
RuntimeError: cuda runtime error (48) : no kernel image is available for execution on the device at mmdet/ops/roi_a
lign/src/roi_align_kernel.cu:139

当我搜索可能的解决方案时,我遇到了几个 stackoverflow 问题,其中的问题是旧 GPU 不受支持,因此我将我的谷歌计算引擎 VM 上的 GPU 更改为从 Nvidia Tesla K80 到 Nvidia Tesla T4 的较新的 GPU。K80 的 cuda 计算能力为 3.7,而新 T4 的计算能力为 7.5,所以我认为这可以解决问题,但事实并非如此。

输出nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P8    12W /  70W |    106MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       918      G   /usr/lib/xorg/Xorg                 95MiB |
|    0   N/A  N/A       974      G   /usr/bin/gnome-shell                9MiB |
+-----------------------------------------------------------------------------+

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

火炬版本:火炬1.4.0+cu100 视觉版本0.5.0+cu100

我在 docker 容器中运行 API,我的 Dockerfile:

# Dockerfile
FROM nvidia/cuda:10.0-devel

RUN nvidia-smi

RUN set -xe \
    && apt-get update \
    && apt-get install python3-pip -y \
    && apt-get install git -y \
    && apt-get install libgl1-mesa-glx -y
RUN pip3 install --upgrade pip

WORKDIR /SingleModelTest

COPY requirements /SingleModelTest/requirements

RUN export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64

RUN pip3 install -r requirements/requirements1.txt
RUN pip3 install -r requirements/requirements2.txt


COPY . /SingleModelTest

ENTRYPOINT ["python3"]

CMD ["TabNetAPI.py"]

编辑:由于 cuda 版本比我安装的更高,我对输出感到困惑nvidia-smi,但事实证明这是正常的:https ://medium.com/@brianhourigan/if-different-cuda-versions-are-显示-nvcc-和-nvidia-smi-its-necessarily-not-a-problem-and-311eda26856c

如果有人有解决方案,我将不胜感激。如果我需要提供更多信息,我很乐意提供。

先感谢您。

标签: pythonubuntupytorchgoogle-compute-enginenvidia

解决方案


推荐阅读