首页 > 解决方案 > Nvidia 1070 Ti Ubuntu 18.04 上的深度学习

问题描述

在这一点上,我正在拔头发,我花了很多时间尝试不同的事情来让我的卡可以使用 Tensorflow。

最近的尝试(与以前有类似的问题)是我尝试安装 tensorflow docker

https://hub.docker.com/r/tensorflow/tensorflow/

我安装了 nvidia-docker 并运行了 SMI,它似乎报告我的 GPU 存在。

然后我运行了这个命令

nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu

下载并启动后,我尝试运行笔记本(首先是 hello tensorflow 笔记本)。

一旦我尝试“导入”张量流(仅使用默认的未修改笔记本),我就会得到一个 KernelRestart。

KernelRestarter: restarting kernel (1/5), keep random ports

我不太确定下一个最佳步骤是什么,我不知道如何对 docker 容器进行故障排除,然后在 jupyter notebook 中进行故障排除。

我之前尝试在没有 docker 容器的情况下在本地运行时遇到过类似的问题。

关于下一步的好建议有什么建议吗?我在这张卡上的花费超出了我的预期,并且不知道如何让它发挥作用。

(我相信我可以使用安装的 tensorflow-gpu 在我的机器上本地导入,但是当我进入 conv2d 部分时,我会无法创建 cudnn 句柄:CUDNN_STATUS_NOT_INITIALIZED,如果我回忆的话,但已经有几天忙碌了)

编辑:对 cuda 和 cudnn 是的,我很容易安装 nvidia-390,看起来很好的测试是 nvidia-smi 有效。我刚刚从头开始编译 tf 并且仍然失败(在这种情况下,导入 tf 不会失败,但同样没有初始化错误,而且可能不是它提到的正确 nvidia 版本,并且我认为调用了 nvidia-390.77)我正在考虑一个新的 18.04安装和较早的 nvidia-3xx 版本安装,尝试“降级”导致 apt 损坏,并尝试修复数天

EDIT2:我也意识到我安装了 CUDA 9.0,但是 cudnn7.1 和 9.1 CUDA(你可以从 nvidia 下载那个组合,不管这意味着什么)。我正在尝试恢复,但我在退出时遇到了很多麻烦,我非常接近于擦除并重新安装 ubuntu 并从那里开始。我有所有的命令,认为它可能会更容易,但我不确定这是否能解决它。(例如 cudnn-9.0-linux-x64-v7.1)

EDIT3:回来回应这个。我写了一个要点,说明我必须做什么才能让我的 GPU 在 ubuntu 16.04 中为我的主机工作,但是我没有在 docker 中测试它,这是它的要点。

https://gist.github.com/onaclov2000/c22fe1456ffa7da6cebd67600003dffb

复制粘贴到这里:

# 1070 Ti
Fresh Install 16.04
(download updates, and include 3rd party)
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install nvidia-384
# Contents
sudo bash -c 'cat >> /etc/modprobe.d/blacklist-nouveau.conf << 'EOF'
blacklist nouveau
options nouveau modeset=0
EOF'
sudo update-initramfs -u
sudo reboot
# Takes about 30-40 minutes 1.5GB approx
wget https://developer.download.nvidia.com/compute/cuda/9.0/secure/Prod/local_installers/cuda_9.0.176_384.81_linux.run
sudo sh cuda_9.0.176_384.81_linux.run
    No to install nvidia accelerated Graphics Driver for Linux
    yes to Cuda 9.0 toolkit
    default
    yes to symbolic link
    yes to samples
    default location is fine


#Alternately (need to test)
#sudo sh cuda_9.0.176_384.81_linux.run --silent --toolkit --samples

cat >> ~/.bashrc << 'EOF'
export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
EOF
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery # Assuming make was successful
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/bandwidthTest
make
./bandwidthTest # Assuming make was successful
# Look for Result = PASS

sudo apt-get install nvidia-cuda-toolkit

# Couldn't find on 16.04 maybe this is a 18.04 upgrade?
#sudo apt-get install cuda-toolkit-9.0 cuda-command-line-tools-9-0

# At this point the driver and CUDA are installed, now it's time to install the CUDNN driver/piece.
#This is the link that I have, be sure to use v7 not v7.1 as I haven't had luck in the past with that (though it might work).
https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v7.0.5/prod/9.0_20171129/cudnn-9.0-linux-x64-v7
# 333 MB so will take a bit
cd ~/Downloads
tar -xvf cudnn-9.0-linux-x64-v7.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/

sudo apt-get install git tmux
cd ~/Downloads
# At this point I'm going to install Anaconda
wget https://repo.continuum.io/archive/Anaconda3-4.3.1-Linux-x86_64.sh -O anaconda-install.sh 
bash anaconda-install.sh # Follow Prompts adding path to bash
source ~/.bashrc
conda create --name ml
source activate ml
pip install tensorflow-gpu==1.5

# test the install
cd ~
mkdir projects
cd projects
git clone https://github.com/tensorflow/models




# Addional notes
Run a sample from the cuda samples folder

/NVIDIA_CUDA-9.0_Samples/1_Utilities/deviceQuery
make
./deviceQuery

Output:

Plenty but ends with the following
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 2
Result = PASS


This tells you which cudnn is installed

cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2

Outputs:
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 1
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION    (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)


# This tells you what

nvcc --version 

Outputs:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

最后,我更新到 18.04,但没有再追究这一切,所以我将在上面的要点上更新 18.04 版本,因为我继续前进。

标签: pythondockertensorflownvidia-docker

解决方案


推荐阅读