python - 如何在 linux ubuntu 上安装 CUDA 10.1?
问题描述
我试图让 tensorflow 与 CUDA 10.1 一起工作,但每次我尝试安装任何驱动程序(任何版本)时,它都会继续安装 CUDA 11(与 tensorflow 不兼容)。我已经尝试过 .deb 安装驱动程序和 CUDA。我也试过安装最新的驱动,然后通过本地的.run文件安装CUDA 10.1,告诉CUDA不要安装驱动。这确实在我的 /usr/local 文件夹中安装了 cuda 10.1,但是当我尝试时nvidia-smi
,它总是每次都指定 CUDA 11。
我做了很多研究,看到提到的版本nvidia-smi
指定了最新支持的 cuda 运行时,但不一定反映实际安装的 CUDA 库?
所以我应该安装了 cuda 10.1(在 /usr/local 下)并尝试在 tensorflow 上运行测试命令:
tf.config.list_physical_devices('GPU')
但这会产生错误:
2020-09-30 17:36:38.765577: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-09-30 17:36:38.765604: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
/home/robbe/Desktop/usiigaci-optimized/venv/lib/python3.7/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
warnings.warn(msg)
2020-09-30 17:36:40.493592: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-09-30 17:36:40.522334: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-30 17:36:40.522943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.455GHz coreCount: 6 deviceMemorySize: 3.94GiB deviceMemoryBandwidth: 104.43GiB/s
2020-09-30 17:36:40.523063: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2020-09-30 17:36:40.583631: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2020-09-30 17:36:40.583961: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcufft.so.10'; dlerror: libcufft.so.10: cannot open shared object file: No such file or directory
2020-09-30 17:36:40.584167: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcurand.so.10'; dlerror: libcurand.so.10: cannot open shared object file: No such file or directory
2020-09-30 17:36:40.584358: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory
2020-09-30 17:36:40.584543: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcusparse.so.10'; dlerror: libcusparse.so.10: cannot open shared object file: No such file or directory
2020-09-30 17:36:40.704140: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2020-09-30 17:36:40.704203: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1753] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
所以很明显它找不到正确的 cuda 10.1 对象库,尽管它确实存在于 /usr/local/cuda-10.1 下。在 /usr/bin 下还有可执行文件(包括显示 cuda 11 的 nvidia-smi),我认为这些会覆盖 /usr/local 下的 10.1 目录?
我尝试过的事情:
- 安装 NVIDIA 驱动程序并使用驱动程序安装 CUDA。执行此操作时,CUDA 实际上安装了 NVIDIA 驱动程序 418,并给了我一个关于无法卸载的内核模块的模糊异常。
- 通过 grub rescue 手动安装驱动程序(因为第一步出现异常),然后安装 CUDA 10.1(本地 .run 并且不包括 nvidia 驱动程序)。所以要完全分开安装nvidia驱动和cuda。
- 通过 GUI 安装最新的 NVIDIA 驱动程序:软件 -> 附加驱动程序
有效的事情:
- 使用本地 .run 安装 CUDA 并告诉它不要包含驱动程序。这会在 /usr/local 下成功安装 cuda 10.1,但 tensorflow或
nvidia-smi
命令无法识别。
我束手无策,我得出的结论是 tensorflow 和 CUDA 很难使用,但我需要它来工作,有人能帮忙吗?
谢谢你。
解决方案
所以我找到了解决方案。
这确实是一个设置正确环境变量的问题。Tensorflow 查找存在于 cuda-10.1/include 和 cuda-10.1/lib64 下的特定目标文件,因此我只是将这些路径作为 LD_LIBRARY_PATH 添加到 ~/.bashrc 中的环境中,如下所示:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/include
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
推荐阅读
- haskell - Haskell - 自定义类型的元组
- python-3.x - 通过字典在python中动态导入
- php - Laravel 5 - 如何在 3 个表之间建立关系?
- python - 如何在不显示渐变的情况下打印张量
- python - 按顺序指定数量的 1 和 0 的 Numpy 数组
- php - 使用 guzzle 的连接池策略
- c# - 如何使用 .NET 折线图覆盖多个相同功能的图,而无需 .NET 在所有点之间绘制线
- php - PHP CSV到关联数组,顶行作为键,列作为值数组
- android - 正确使用 AdMob 代码
- selenium-webdriver - 突然间,我开始从范围报告中收到错误消息。我正在使用 Test NG 而不是 maven