tensorflow - 使用 tensorflow 2.5.0 和 nvidia 11.1 / 455 的 GPU 无法与 Debian 10 一起使用
问题描述
我在让 GPU 在 python 中工作时遇到问题。我现在正在尝试 tf-nightly-gpu 2.5.0 tensorflow。我一直试图让这个工作超过一个星期,做了大量的谷歌搜索。我已经在不同的 virtenv 设置中尝试了几种不同版本的 tensorflow,但到目前为止,nadda。感谢您提供的任何建议!我通常可以解决这种事情,但我真的被困在这个问题上。
这些是 p102-100 卡。有趣的是,它们似乎与 hashcat 一起工作:
(phdproj) root@gpu:~/phd-cnn# hashcat -b -m 1000
hashcat (v6.1.1) starting in benchmark mode...
CUDA API (CUDA 11.1)
* Device #1: P102-100, 4803/5059 MB, 25MCU
* Device #2: P102-100, 4803/5059 MB, 25MCU
* Device #3: P102-100, 4803/5059 MB, 25MCU
* Device #4: P102-100, 4803/5059 MB, 25MCU
* Device #5: P102-100, 4803/5059 MB, 25MCU
OpenCL API (OpenCL 1.2 CUDA 11.1.114) - Platform #1 [NVIDIA Corporation]
* Device #6: P102-100, skipped
* Device #7: P102-100, skipped
* Device #8: P102-100, skipped
* Device #9: P102-100, skipped
* Device #10: P102-100, skipped
OpenCL API (OpenCL 1.2 pocl 1.5, None+Asserts, LLVM 9.0.1, RELOC, SLEEF, DISTRO, POCL_DEBUG) - Platform #2 [The pocl project]
* Device #11: pthread-Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz, skipped
Benchmark relevant options:
* --optimized-kernel-enable
Hashmode: 1000 - NTLM
Speed.#1.........: 48910.2 MH/s (34.00ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#2.........: 48186.9 MH/s (34.51ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#3.........: 48377.3 MH/s (34.36ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#4.........: 47245.2 MH/s (35.30ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#5.........: 48653.3 MH/s (34.23ms) @ Accel:64 Loops:1024 Thr:1024 Vec:1
Speed.#*.........: 241.4 GH/s
Started: Sun Dec 27 11:15:45 2020
Stopped: Sun Dec 27 11:16:06 2020
这是一些信息:
Sun Dec 27 10:58:05 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 P102-100 Off | 00000000:02:00.0 Off | N/A |
| 0% 17C P8 6W / 250W | 131MiB / 5059MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 P102-100 Off | 00000000:03:00.0 Off | N/A |
| 0% 17C P8 6W / 250W | 131MiB / 5059MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 P102-100 Off | 00000000:04:00.0 Off | N/A |
| 0% 17C P8 5W / 250W | 131MiB / 5059MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 P102-100 Off | 00000000:05:00.0 Off | N/A |
| 0% 16C P8 5W / 250W | 131MiB / 5059MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 P102-100 Off | 00000000:06:00.0 Off | N/A |
| 0% 12C P8 6W / 250W | 131MiB / 5059MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 771 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 752630 C ...hd-cnn/phdproj/bin/python 125MiB |
| 1 N/A N/A 771 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 752630 C ...hd-cnn/phdproj/bin/python 125MiB |
| 2 N/A N/A 771 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 752630 C ...hd-cnn/phdproj/bin/python 125MiB |
| 3 N/A N/A 771 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 752630 C ...hd-cnn/phdproj/bin/python 125MiB |
| 4 N/A N/A 771 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 752630 C ...hd-cnn/phdproj/bin/python 125MiB |
+-----------------------------------------------------------------------------+
#nvidia-detect
Detected NVIDIA GPUs:
02:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
03:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
04:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
05:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
06:00.0 3D controller [0302]: NVIDIA Corporation GP102 [P102-100] [10de:1b07] (rev a1)
Checking card: NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
Checking card: NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
Checking card: NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
Checking card: NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
Checking card: NVIDIA Corporation GP102 [P102-100] (rev a1)
Uh oh. Your card is not supported by any driver version up to 455.45.01.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
import tensorflow as tf
import numpy
print(tf.__version__)
print(numpy.version.version)
print(tf.config.experimental.list_physical_devices('GPU'))
2.5.0-dev20201216
1.19.4
0
这是控制台显示的内容:
2020-12-27 18:43:14.381247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.382375: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 0 with properties:
pciBusID: 0000:02:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.382482: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.383534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 1 with properties:
pciBusID: 0000:03:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.383593: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.384576: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 2 with properties:
pciBusID: 0000:04:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.384632: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.385610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 3 with properties:
pciBusID: 0000:05:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.385664: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-12-27 18:43:14.386652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1760] Found device 4 with properties:
pciBusID: 0000:06:00.0 name: P102-100 computeCapability: 6.1
coreClock: 1.683GHz coreCount: 25 deviceMemorySize: 4.94GiB deviceMemoryBandwidth: 410.15GiB/s
2020-12-27 18:43:14.386677: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2020-12-27 18:43:14.386691: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2020-12-27 18:43:14.386706: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2020-12-27 18:43:14.386718: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2020-12-27 18:43:14.386727: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2020-12-27 18:43:14.386795: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2020-12-27 18:43:14.386810: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2020-12-27 18:43:14.386858: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory
2020-12-27 18:43:14.386871: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1797] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-12-27 18:43:14.386908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1300] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-27 18:43:14.386919: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1306] 0 1 2 3 4
2020-12-27 18:43:14.386926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 0: N N N N N
2020-12-27 18:43:14.386930: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 1: N N N N N
2020-12-27 18:43:14.386934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 2: N N N N N
2020-12-27 18:43:14.386937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 3: N N N N N
2020-12-27 18:43:14.386941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1319] 4: N N N N N
然后我尝试了这个:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
得到了这个:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 16091595871081806106
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 4523556864
locality {
bus_id: 1
links {
}
}
incarnation: 12552843679060776178
physical_device_desc: "device: 0, name: P102-100, pci bus id: 0000:02:00.0, compute capability: 6.1"
, name: "/device:GPU:1"
device_type: "GPU"
memory_limit: 4523556864
locality {
bus_id: 1
links {
}
}
incarnation: 10038670458830395131
physical_device_desc: "device: 1, name: P102-100, pci bus id: 0000:03:00.0, compute capability: 6.1"
, name: "/device:GPU:2"
device_type: "GPU"
memory_limit: 4523556864
locality {
bus_id: 1
links {
}
}
incarnation: 5918382531485927936
physical_device_desc: "device: 2, name: P102-100, pci bus id: 0000:04:00.0, compute capability: 6.1"
, name: "/device:GPU:3"
device_type: "GPU"
memory_limit: 4523556864
locality {
bus_id: 1
links {
}
}
incarnation: 12194179101487290626
physical_device_desc: "device: 3, name: P102-100, pci bus id: 0000:05:00.0, compute capability: 6.1"
, name: "/device:GPU:4"
device_type: "GPU"
memory_limit: 4523556864
locality {
bus_id: 1
links {
}
}
incarnation: 4221936322635506021
physical_device_desc: "device: 4, name: P102-100, pci bus id: 0000:06:00.0, compute capability: 6.1"
]
解决方案
因此,由于缺少库,我安装了一个 tensorflow 容器。无法让它正常工作,所以我最终将所需的库从容器复制到我的 /usr/lib/x86_64-linux-gpu 目录中,这似乎解决了我的问题。我使用了这个命令:
cp /var/lib/docker/overlay2/7327c1946fb4e673086c1c74f310e4ec3a767101d37640542737a75b7594a847/diff/usr/lib/x86_64-linux-gnu/libcudnn* /usr/lib/x86_64-linux-gnu/
推荐阅读
- flutter - 在flutter中打开一个本地IP网页
- python - 如何在 pyqt qdock 小部件中添加图像
- nginx - nginx速率限制的理想配置
- android - 微调器选项不显示
- python - Beautifulsoup 检查 span 类和 rel
- mongodb - FailedToParse:数据库名称不能有 mongodb 的保留字符
- python - 试图抓取一个需要先用 Python 登录的网站,但没有成功
- django - Django Rest Framework - 使用电子邮件验证注册用户
- flutter - 如何访问有状态小部件中的方法?
- python-3.7 - python石榴贝叶斯网络初始化