python - 是否可以使用抢占式 tpu 在 Google 的 AI 平台上训练深度神经网络？

问题描述

我最近开始使用 Google 的 AI 平台来训练我的深度神经网络模型。由于我们是一个相对较小的研究实验室，我尝试使用抢占式TPU 和主机训练模型。不幸的是，我没有在文档中找到如何做到这一点的方法。

目前，我正在使用以下 shell 脚本提交培训作业：

!/bin/bash
BUCKET_NAME="training_data"
JOB_NAME="PPI_$(date +"%Y%m%d_%H%M%S")"
JOB_DIR="gs://$BUCKET_NAME/hp_job_dir"
TRAINER_PACKAGE_PATH="./training_job_folder/trainer"
MAIN_TRAINER_MODULE="trainer.train"
HPTUNING_CONFIG="training_job_folder/trainer/hptuning_config.yaml"
RUNTIME_VERSION=2.4
PYTHON_VERSION=3.7
REGION="us-central1"
SCALE_TIER=CUSTOM
MASTER_MACHINE_TYPE=n2-highmem-16

gcloud config set project vocal-unfolding-311510

gcloud ai-platform jobs submit training $JOB_NAME \
  --job-dir $JOB_DIR \
  --package-path $TRAINER_PACKAGE_PATH \
  --module-name $MAIN_TRAINER_MODULE \
  --region $REGION \
  --runtime-version=$RUNTIME_VERSION \
  --python-version=$PYTHON_VERSION \
  --scale-tier $SCALE_TIER \
  --config $HPTUNING_CONFIG \
  --master-machine-type $MASTER_MACHINE_TYPE

gcloud ai-platform jobs stream-logs $JOB_NAME

如果有人能建议我如何更改脚本以仅使用抢占式主机或 TPU，我将非常感激。

在此先感谢 Manuel S.

标签： pythontensorflowkerasgoogle-cloud-ai

python - 是否可以使用抢占式 tpu 在 Google 的 AI 平台上训练深度神经网络？

问题描述

解决方案

推荐阅读