首页 > 解决方案 > 无法在 AWS EC2 上使用 GPU 运行 Keras

问题描述

我正在尝试使用 g2.2xlarge EC2 实例来训练一些简单的 ml 模型,但我不确定 GPU 支持是否有效。恐怕不会,因为培训时间与我那台蹩脚的笔记本电脑非常相似。

我已经按照这些官方指南安装了 Tensorflow GPU 支持,以下是一些命令的输出。

nvidia-smi在 shell 中运行返回

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           On   | 00000000:00:03.0 Off |                  N/A |
| N/A   29C    P8    17W / 125W |      0MiB /  4037MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

跑步pip list

...
jupyterlab (0.31.5)
jupyterlab-launcher (0.10.2)
Keras (2.2.2)
Keras-Applications (1.0.4)
Keras-Preprocessing (1.0.2)
kiwisolver (1.0.1)
...
tensorboard (1.10.0)
tensorflow (1.10.0)
tensorflow-gpu (1.10.0)
...

我通过运行得到非常相似的输出conda list

Python 版本是Python 3.6.4 |Anaconda.

其他一些希望有用的输出:

from keras import backend as K
K.tensorflow_backend._get_available_gpus()

2018-08-11 16:42:54.942052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-11 16:42:54.943269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: 
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-08-11 16:42:54.943309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:42:54.943337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:42:54.943355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-08-11 16:42:54.943371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
[]

.

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

2018-08-11 16:44:03.560954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:44:03.561015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:44:03.561035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-08-11 16:44:03.561052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 15704826459248001252
]

.

import tensorflow as tf
tf.test.is_gpu_available()

2018-08-11 16:45:22.049670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:45:22.049748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:45:22.049782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0 
2018-08-11 16:45:22.049814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
False

你能确认 Keras 没有在 GPU 上运行吗?你对如何最终解决这个问题有什么建议吗?

谢谢

编辑:

我尝试使用 p2.xlarge EC2 实例,但问题似乎没有解决。这是几个输出

>>> from keras import backend as K
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
2018-08-11 21:54:24.238022: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-08-11 21:54:24.247402: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2018-08-11 21:54:24.247430: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (ip-172-31-2-145): /proc/driver/nvidia/version does not exist
[]

.

>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 80385595218229545
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6898783310276970136
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 4859092998934769352
physical_device_desc: "device: XLA_CPU device"
]

标签: tensorflowamazon-ec2keras

解决方案


我通过执行以下操作解决了这个问题:

  • 使用 p2.xlarge EC2 实例
  • 选择Deep Learning AMI (Ubuntu) v12.0作为启动 AMI
  • 用于conda env list查看可用环境列表,然后激活我需要的环境source activate tensorflow_p36

最后一点可能是我在之前的测试中从未意识到要做的事情。

之后,一切都按预期工作

>>> from keras import backend as K
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
['/job:localhost/replica:0/task:0/device:GPU:0']

此外,runningnvidia-smi显示了模型训练期间 gpu 资源的使用情况,与nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER.

在我的示例案例中,单个 epoch 的训练从 42 秒变为 13 秒。


推荐阅读