首页 > 解决方案 > 较新的 cuda 版本导致的 caffe 内存错误

问题描述

我用 caffe-gpu 和 cuda 8 创建了一个环境

conda create -n py27Cfe-gpu-p27h03f526a_2
conda install caffe-gpu=1.0=py27h03f526a_2

caffe-gpu                 1.0              py27h03f526a_2   
cudatoolkit               8.0                           3  
cudnn                     6.0.21                cuda8.0_0  
jupyter                   1.0.0                    py27_7  

通过在“conda install caffe-gpu”中选择特定的构建,我得到了 cuda 8。

我还用 cuda 9 创建了一个 caffe gpu 环境

conda create -n p27cu9Cfegpu
conda install caffe-gpu=1.0=py27heda4471_3

caffe-gpu                 1.0              py27heda4471_3
cudatoolkit               9.0                  h13b8566_0  
cudnn                     7.3.1                 cuda9.0_0
jupyter                   1.0.0                    py27_7

我用两者测试了谷歌 deepdream jupyter notebook。cuda 8 环境可以毫无困难地执行。cuda 9 环境在这一层阻塞

I0505 12:29:44.577164  9839 net.cpp:744] Ignoring source layer loss2/loss
I0505 12:29:44.578850  9839 net.cpp:744] Ignoring source layer loss3/loss3
F0505 12:29:55.785749  9839 syncedmem.cpp:71] Check failed: error == cudaSuccess (2 vs. 0)  out of memory
*** Check failure stack trace: ***

我尝试在 deploy.prototxt 文件的第一个参数中将批量大小更改为 1,如下所示:

name: "GoogleNet"
layer {
  name: "data"
  type: "Input"
  top: "data"
  input_param { shape: { dim: 1 dim: 3 dim: 224 dim: 224 } }
}

但它没有帮助。我意识到这两个环境之间还有许多其他变化,它们就在这里。

other differences between the cuda9 environment and the cuda8 environment are:
(Cuda8 env lacks what has a minus but has what has a plus)

-backports_abc             0.5                      py27_0  
+backports_abc             0.5              py27h7b3c97b_0  

-caffe-gpu                 1.0              py27heda4471_3  
+caffe-gpu                 1.0              py27h03f526a_2  

-cudatoolkit               9.0                  h13b8566_0  
-cudnn                     7.3.1                 cuda9.0_0  
-cycler                    0.10.0                   py27_0  
+cudatoolkit               8.0                           3  
+cudnn                     6.0.21                cuda8.0_0  
+cycler                    0.10.0           py27hc7354d3_0  

-h5py                      2.7.1            py27h2697762_0  
+h5py                      2.8.0            py27h39dcb92_0  

-hdf5                      1.10.1               h9caa474_1  
+hdf5                      1.8.18               h6792536_1  

-ipython_genutils          0.2.0            py27h89fb69b_0  
+ipython_genutils          0.2.0                    py27_0  

-libprotobuf               3.5.2                h6f1eeef_0  
+libprotobuf               3.4.1                h5b8497f_0  

+linecache2                1.0.0                    py27_0  

-nbformat                  4.4.0            py27hed7f2b2_0  
+nbformat                  4.4.0                    py27_0  

-opencv                    3.3.1            py27hdcf4849_0  
+opencv                    3.3.1            py27h9bb06ff_1  

-protobuf                  3.5.2            py27hf484d3e_1  
+protobuf                  3.4.1            py27h2ba6a9c_0  

traitlets                 4.3.2                    py27_0  
-wcwidth                   0.1.7                    py27_0  
+traceback2                1.4.0                    py27_0  
+traitlets                 4.3.2            py27hd6ce930_0  
+unittest2                 1.1.0                    py27_0  
+wcwidth                   0.1.7            py27h9e3e1ab_0  

在每种情况下,脚本运行时都会出现另一个小错误,所以我认为它不是导致 cuda9 失败的原因

 Network initialization done.
I0505 12:29:44.542949  9839 upgrade_proto.cpp:53] Attempting to upgrade input file specified using deprecated V1LayerParameter: ./modelZoo/bvlc_googlenet/bvlc_googlenet.caffemodel
I0505 12:29:44.575798  9839 upgrade_proto.cpp:61] Successfully upgraded file specified using deprecated V1LayerParameter

任何人都可以阐明这种记忆情况吗?显卡是英伟达 1050Ti。Ubuntu 18.04 安装了来自 Nvidia 的最新驱动程序

nvidia-smi
Sun May  5 12:44:44 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39       Driver Version: 418.39       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  On   | 00000000:01:00.0  On |                  N/A |
| 20%   32C    P5    N/A /  75W |    406MiB /  4038MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1746      G   /usr/lib/xorg/Xorg                            26MiB |
|    0      2296      G   /usr/bin/gnome-shell                          48MiB |
|    0      3226      G   /usr/lib/xorg/Xorg                           195MiB |
|    0      3358      G   /usr/bin/gnome-shell                         132MiB |
+-----------------------------------------------------------------------------+

标签: pythoncaffe

解决方案


推荐阅读