首页 > 解决方案 > 服务器重置后 A100 上的 CUDA_ERROR_NOT_INITIALIZED

问题描述

我正在使用 A100 GPU 的服务器上运行。在服务器重置后尝试运行 tensorflow 代码时,tensorflow 无法识别 GPU。运行tf.config.list_physical_devices('GPU')产量CUDA_ERROR_NOT_INITIALIZED

2021-09-09 07:41:42.956917: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-09-09 07:41:43.899014: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error
2021-09-09 07:41:43.899148: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: f42a3aa12bd1
2021-09-09 07:41:43.899169: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: f42a3aa12bd1
2021-09-09 07:41:43.899890: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 460.32.3
2021-09-09 07:41:43.899955: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 460.32.3
2021-09-09 07:41:43.899969: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 460.32.3

运行nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  A100-PCIE-40GB      Off  | 00000000:00:06.0 Off |                   On |
| N/A   46C    P0    40W / 250W |      0MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

为什么我会得到CUDA_ERROR_NOT_INITIALIZED?服务器在重置之前运行良好,并且 nvidia-smi 显然可以正常工作。

标签: tensorflowcudagpunvidia

解决方案


您的 GPU 上似乎启用了 NVIDIA 多实例 GPU (MIG),但您尚未定义任何 GPU 实例。这可以从nvidia-smi显示一个MIG devices表的事实中看出,但它是空的(No MIG devices found)。

MIG 文档指出:

如果不创建 GPU 实例(和相应的计算实例),CUDA 工作负载就无法在 GPU 上运行。换句话说,仅仅在 GPU 上启用 MIG 模式是不够的。另请注意,创建的 MIG 设备不会在系统重新启动后保持不变。因此,如果 GPU 或系统被重置,用户或系统管理员需要重新创建所需的 MIG 配置。

您可能在重置之前定义了 MIG 配置,但服务器重置删除了该配置。您需要重新配置 GPU 实例以使 GPU 再次工作。如果您只想要一个基本配置,其中您只有一个使用所有资源的 GPU 实例,您可以运行:

sudo nvidia-smi mig -cgi 0 -C

如果您需要比这更高级的配置,则应查阅文档。

配置 GPU 实例后,该nvidia-smi命令应显示MIG devices表已满。在我们的例子中,它应该有一个条目:

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    0   0   0  |      0MiB / 40536MiB | 98      0 |  7   0    5    1    1 |
|                  |      1MiB / 65536MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

推荐阅读