首页 > 解决方案 > GKE - 无法使 cuda 与 pytorch 一起工作

问题描述

我已经使用 nvidia tesla k80 设置了一个 kubernetes 节点,并按照本教程尝试在 nvidia 驱动程序和 cuda 驱动程序工作的情况下运行 pytorch docker 映像。

我的 nvidia 驱动程序和 cuda 驱动程序都可以在我的 pod 中访问/usr/local

$> ls /usr/local
bin  cuda  cuda-10.0  etc  games  include  lib  man  nvidia  sbin  share  src

我的 GPU 也被我的图像所识别nvidia/cuda:10.0-runtime-ubuntu18.04

$> /usr/local/nvidia/bin/nvidia-smi
Fri Nov  8 16:24:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   73C    P8    35W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

LD_LIBRARY_PATH但是在安装 pytorch 1.3.0 之后,即使设置为,我也无法让 pytorch 识别我的 cuda 安装/usr/local/nvidia/lib64:/usr/local/cuda/lib64

$> python3 -c "import torch; print(torch.cuda.is_available())"
False

$> python3
Python 3.6.8 (default, Oct  7 2019, 12:59:55)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print ('\t\ttorch.cuda.current_device()    =', torch.cuda.current_device())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 386, in current_device
    _lazy_init()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 192, in _lazy_init
    _check_driver()
  File "/usr/local/lib/python3.6/dist-packages/torch/cuda/__init__.py", line 111, in _check_driver
    of the CUDA driver.""".format(str(torch._C._cuda_getDriverVersion())))
AssertionError:
The NVIDIA driver on your system is too old (found version 10000).
Please update your GPU driver by downloading and installing a new
version from the URL: http://www.nvidia.com/Download/index.aspx
Alternatively, go to: https://pytorch.org to install
a PyTorch version that has been compiled with your version
of the CUDA driver.

上面的错误很奇怪,因为我的图像的 cuda 版本是 10.0 并且 Google GKE 提到:

最新支持的 CUDA 版本是 10.0

此外,自动安装 NVIDIA 驱动程序的是 GKE 的守护进程

将 GPU 节点添加到集群后,您需要在节点上安装 NVIDIA 的设备驱动程序。

Google 提供了一个 DaemonSet,它会自动为您安装驱动程序。有关 Container-Optimized OS (COS) 和 Ubuntu 节点的安装说明,请参阅以下部分。

要部署安装 DaemonSet,请运行以下命令:kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

我已经尝试了我能想到的一切,但没有成功......

标签: kubernetesgoogle-cloud-platformpytorchgoogle-kubernetes-engine

解决方案


我通过从pytorch/pytorch:1.2-cuda10.0-cudnn7-devel.

我仍然不知道为什么在它无法正常工作之前,然后通过猜测pytorch 1.3.0cuda 10.0.


推荐阅读