首页 > 解决方案 > 使用 tensorflow 2.5.0 和 nvidia 11.1 / 455 的 GPU 无法与 Debian 10 一起使用

问题描述

我在让 GPU 在 python 中工作时遇到问题。我现在正在尝试 tf-nightly-gpu 2.5.0 tensorflow。我一直试图让这个工作超过一个星期,做了大量的谷歌搜索。我已经在不同的 virtenv 设置中尝试了几种不同版本的 tensorflow,但到目前为止,nadda。感谢您提供的任何建议!我通常可以解决这种事情,但我真的被困在这个问题上。

这些是 p102-100 卡。有趣的是,它们似乎与 hashcat 一起工作:

(phdproj) root@gpu:~/phd-cnn# hashcat -b -m 1000
hashcat (v6.1.1) starting in benchmark mode...

CUDA API (CUDA 11.1)

* Device #1: P102-100, 4803/5059 MB, 25MCU
* Device #2: P102-100, 4803/5059 MB, 25MCU
* Device #3: P102-100, 4803/5059 MB, 25MCU
* Device #4: P102-100, 4803/5059 MB, 25MCU
* Device #5: P102-100, 4803/5059 MB, 25MCU

OpenCL API (OpenCL 1.2 CUDA 11.1.114) - Platform #1 [NVIDIA Corporation]

* Device #6: P102-100, skipped
* Device #7: P102-100, skipped
* Device #8: P102-100, skipped
* Device #9: P102-100, skipped
* Device #10: P102-100, skipped

OpenCL API (OpenCL 1.2 pocl 1.5, None+Asserts, LLVM 9.0.1, RELOC, SLEEF, DISTRO, POCL_DEBUG) - Platform #2 [The pocl project]

* Device #11: pthread-Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, skipped

Benchmark relevant options:

* --optimized-kernel-enable

Hashmode: 1000 - NTLM

Speed.#1.........: 48910.2 MH/s (34.00ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#2.........: 48186.9 MH/s (34.51ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#3.........: 48377.3 MH/s (34.36ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#4.........: 47245.2 MH/s (35.30ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#5.........: 48653.3 MH/s (34.23ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#*.........:   241.4 GH/s

Started: Sun Dec 27 11:15:45 2020
Stopped: Sun Dec 27 11:16:06 2020

这是一些信息:

Sun Dec 27 10:58:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  P102-100            Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   17C    P8     6W / 250W |    131MiB /  5059MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  P102-100            Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   17C    P8     6W / 250W |    131MiB /  5059MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  P102-100            Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   17C    P8     5W / 250W |    131MiB /  5059MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  P102-100            Off  | 00000000:05:00.0 Off |                  N/A |
|  0%   16C    P8     5W / 250W |    131MiB /  5059MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  P102-100            Off  | 00000000:06:00.0 Off |                  N/A |
|  0%   12C    P8     6W / 250W |    131MiB /  5059MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       771      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A    752630      C   ...hd-cnn/phdproj/bin/python      125MiB |
|    1   N/A  N/A       771      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A    752630      C   ...hd-cnn/phdproj/bin/python      125MiB |
|    2   N/A  N/A       771      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A    752630      C   ...hd-cnn/phdproj/bin/python      125MiB |
|    3   N/A  N/A       771      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A    752630      C   ...hd-cnn/phdproj/bin/python      125MiB |
|    4   N/A  N/A       771      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A    752630      C   ...hd-cnn/phdproj/bin/python      125MiB |
+-----------------------------------------------------------------------------+

#nvidia-detect
Detected NVIDIA GPUs:
02:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
03:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
04:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
05:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
06:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)

Checking card:  NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

Checking card:  NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

Checking card:  NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

Checking card:  NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

Checking card:  NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.

import tensorflow as tf
import numpy
print(tf.__version__)
print(numpy.version.version)
print(tf.config.experimental.list_physical_devices('GPU'))

2.5.0-dev20201216
1.19.4
0

这是控制台显示的内容:

2020-12-27 18:43:14.381247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.382375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 0 with properties:
pciBusID: 0000:02:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.382482: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.383534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 1 with properties:
pciBusID: 0000:03:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.383593: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.384576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 2 with properties:
pciBusID: 0000:04:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.384632: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.385610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 3 with properties:
pciBusID: 0000:05:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.385664: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.386652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 4 with properties:
pciBusID: 0000:06:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.386677: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-12-27 18:43:14.386691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-12-27 18:43:14.386706: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-12-27 18:43:14.386718: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-12-27 18:43:14.386727: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-12-27 18:43:14.386795: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2020-12-27 18:43:14.386810: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-12-27 18:43:14.386858: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2020-12-27 18:43:14.386871: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-12-27 18:43:14.386908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-27 18:43:14.386919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306]      0 1 2 3 4
2020-12-27 18:43:14.386926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 0:   N N N N N
2020-12-27 18:43:14.386930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 1:   N N N N N
2020-12-27 18:43:14.386934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 2:   N N N N N
2020-12-27 18:43:14.386937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 3:   N N N N N
2020-12-27 18:43:14.386941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 4:   N N N N N

然后我尝试了这个:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

得到了这个:

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 16091595871081806106
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 4523556864
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12552843679060776178
physical_device_desc: "device: 0, name: P102-100, pci bus id: 0000:02:00.0, compute capability: 6.1"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 4523556864
locality {
  bus_id: 1
  links {
  }
}
incarnation: 10038670458830395131
physical_device_desc: "device: 1, name: P102-100, pci bus id: 0000:03:00.0, compute capability: 6.1"
, name: "/device:GPU:2"
device_type: "GPU"
memory_limit: 4523556864
locality {
  bus_id: 1
  links {
  }
}
incarnation: 5918382531485927936
physical_device_desc: "device: 2, name: P102-100, pci bus id: 0000:04:00.0, compute capability: 6.1"
, name: "/device:GPU:3"
device_type: "GPU"
memory_limit: 4523556864
locality {
  bus_id: 1
  links {
  }
}
incarnation: 12194179101487290626
physical_device_desc: "device: 3, name: P102-100, pci bus id: 0000:05:00.0, compute capability: 6.1"
, name: "/device:GPU:4"
device_type: "GPU"
memory_limit: 4523556864
locality {
  bus_id: 1
  links {
  }
}
incarnation: 4221936322635506021
physical_device_desc: "device: 4, name: P102-100, pci bus id: 0000:06:00.0, compute capability: 6.1"
]

标签: tensorflowdebiannvidia

解决方案


因此,由于缺少库,我安装了一个 tensorflow 容器。无法让它正常工作,所以我最终将所需的库从容器复制到我的 /usr/lib/x86_64-linux-gpu 目录中,这似乎解决了我的问题。我使用了这个命令:

cp /var/lib/docker/overlay2/7327c1946fb4e673086c1c74f310e4ec3a767101d37640542737a75b7594a847/diff/usr/lib/x86_64-linux-gnu/libcudnn* /usr/lib/x86_64-linux-gnu/

推荐阅读