tensorflow - 无法在 AWS EC2 上使用 GPU 运行 Keras
问题描述
我正在尝试使用 g2.2xlarge EC2 实例来训练一些简单的 ml 模型,但我不确定 GPU 支持是否有效。恐怕不会,因为培训时间与我那台蹩脚的笔记本电脑非常相似。
我已经按照这些官方指南安装了 Tensorflow GPU 支持,以下是一些命令的输出。
nvidia-smi
在 shell 中运行返回
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 On | 00000000:00:03.0 Off | N/A |
| N/A 29C P8 17W / 125W | 0MiB / 4037MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
跑步pip list
...
jupyterlab (0.31.5)
jupyterlab-launcher (0.10.2)
Keras (2.2.2)
Keras-Applications (1.0.4)
Keras-Preprocessing (1.0.2)
kiwisolver (1.0.1)
...
tensorboard (1.10.0)
tensorflow (1.10.0)
tensorflow-gpu (1.10.0)
...
我通过运行得到非常相似的输出conda list
。
Python 版本是Python 3.6.4 |Anaconda
.
其他一些希望有用的输出:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
2018-08-11 16:42:54.942052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-11 16:42:54.943269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-08-11 16:42:54.943309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:42:54.943337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:42:54.943355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:42:54.943371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
[]
.
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
2018-08-11 16:44:03.560954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:44:03.561015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:44:03.561035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:44:03.561052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 15704826459248001252
]
.
import tensorflow as tf
tf.test.is_gpu_available()
2018-08-11 16:45:22.049670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:45:22.049748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:45:22.049782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:45:22.049814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
False
你能确认 Keras 没有在 GPU 上运行吗?你对如何最终解决这个问题有什么建议吗?
谢谢
编辑:
我尝试使用 p2.xlarge EC2 实例,但问题似乎没有解决。这是几个输出
>>> from keras import backend as K
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
2018-08-11 21:54:24.238022: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-08-11 21:54:24.247402: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2018-08-11 21:54:24.247430: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (ip-172-31-2-145): /proc/driver/nvidia/version does not exist
[]
.
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 80385595218229545
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6898783310276970136
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 4859092998934769352
physical_device_desc: "device: XLA_CPU device"
]
解决方案
我通过执行以下操作解决了这个问题:
- 使用 p2.xlarge EC2 实例
- 选择Deep Learning AMI (Ubuntu) v12.0作为启动 AMI
- 用于
conda env list
查看可用环境列表,然后激活我需要的环境source activate tensorflow_p36
最后一点可能是我在之前的测试中从未意识到要做的事情。
之后,一切都按预期工作
>>> from keras import backend as K
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
['/job:localhost/replica:0/task:0/device:GPU:0']
此外,runningnvidia-smi
显示了模型训练期间 gpu 资源的使用情况,与nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER
.
在我的示例案例中,单个 epoch 的训练从 42 秒变为 13 秒。
推荐阅读
- typescript - 带有 AutoComplete 和 ListItem 的 autoHighLight
- javascript - NodeJS不加载外部javascript
- corda - 使用 Corda OS 时,我在哪里可以查看流式医院数据?
- laravel - 文件上传存储()不返回路径 Laravel
- android - 为什么应用内升级不起作用?为什么它不使用 onSuccss() 方法?
- python - 在 Python 2 和 Python 3 中匹配 Unicode 字符
- jquery - Show modal after clicking on a photo taken from external API
- spring-boot - Donot wrap exceptions of Feign client Fall back method with Hystrix RunTime Exception
- google-apps-script - Paste multiple rows and generate ID for them
- python-3.x - 无法使用 quickfix 与 FIX 协议连接