首页 > 解决方案 > gcloud ML 引擎 - Keras 未在 GPU 上运行

问题描述

我是谷歌云机器学习引擎的新手,我正在尝试在 gcloud 中基于 Keras 训练用于图像分类的 DL 算法。为了在 gcloud 上配置 GPU,我已经包含'tensorflow-gpu'setup.py install_requires. 我cloud-gpu.yaml的是以下

trainingInput:
  scaleTier: BASIC_GPU
  runtimeVersion: "1.0"

在我添加的代码中:

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

一开始和

with tf.device('/gpu:0'):

在任何 Keras 代码之前。

结果是 gcloud 正在识别 gpu 但不使用它,如您所见

实战云训练截图:

INFO    2018-11-18 12:19:59 -0600   master-replica-0        Epoch 1/20
INFO    2018-11-18 12:20:56 -0600   master-replica-0          1/219 [..............................] - ETA: 4:17:12 - loss: 0.8846 - acc: 0.5053 - f1_measure: 0.1043
INFO    2018-11-18 12:21:57 -0600   master-replica-0          2/219 [..............................] - ETA: 3:51:32 - loss: 0.8767 - acc: 0.5018 - f1_measure: 0.1013
INFO    2018-11-18 12:22:59 -0600   master-replica-0          3/219 [..............................] - ETA: 3:46:49 - loss: 0.8634 - acc: 0.5039 - f1_measure: 0.1010
INFO    2018-11-18 12:23:58 -0600   master-replica-0          4/219 [..............................] - ETA: 3:44:59 - loss: 0.8525 - acc: 0.5045 - f1_measure: 0.0991
INFO    2018-11-18 12:24:48 -0600   master-replica-0          5/219 [..............................] - ETA: 3:41:17 - loss: 0.8434 - acc: 0.5031 - f1_measure: 0.0992Sun Nov 18 18:24:48 2018       
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |-------------------------------+----------------------+----------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |===============================+======================+======================|
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | N/A   32C    P0    56W / 149W |  10955MiB / 11441MiB |      0%      Default |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-------------------------------+----------------------+----------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0                                                                                       
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+
INFO    2018-11-18 12:24:48 -0600   master-replica-0        | Processes:                                                       GPU Memory |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |  GPU       PID   Type   Process name                             Usage      |
INFO    2018-11-18 12:24:48 -0600   master-replica-0        |=============================================================================|
INFO    2018-11-18 12:24:48 -0600   master-replica-0        +-----------------------------------------------------------------------------+

基本上在训练期间GPU使用率保持在0%,这怎么可能?

标签: pythontensorflowkerasgoogle-cloud-platformdeep-learning

解决方案


我建议使用standard_gpu具有相同 n1-standard-8 和一个 k80 GPU 的cloud-gpu.yaml

trainingInput:
  scaleTier: CUSTOM
  # standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
  masterType: standard_gpu
  runtimeVersion: "1.5"

这个:

with tf.device('/gpu:0'):

应该是

with tf.device('/device:GPU:0'):

我建议查看此cnn_with_keras.py以获得更好的示例。


推荐阅读