首页 > 解决方案 > cloud-ml 作业在运行数千步后取消

问题描述

我成功地开始了 Google Cloud 中的培训工作。但是,在运行 30 分钟到 1 小时和几千步后,它们会以无意义的错误消息结束:“CancelledError: Cancelled”。

我正在对分布在 16 个 tfrecord 文件中的约 30K 图像进行训练。在单个文件(约 5K 左右)中对较少数量的图像进行训练时,我没有这个问题

以下是详细信息:我使用以下命令开始工作:

gcloud ai-platform jobs submit training my_job_name \
 --runtime-version 1.13 \
--job-dir=gs://image-training/my_job_dir \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-east1 --config object_detection/CLOUDgpu.yaml \
--python-version 3.5 \
-- \
--model_dir gs://image-training/my_job_dir \
--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config 

这是我的 YAML 文件:

trainingInput:
  runtimeVersion: "1.13"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 9
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

我的配置文件引用这样的数据文件:

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://image-training/t0423data/train_*_re.tfrecord"
  }
  num_readers:3
  label_map_path: "gs://image-training/PigCount/label_map.pbtxt"
}

最后,完整的错误:

The replica worker 6 exited with a non-zero status of 1. Termination reason:
Error. Traceback (most recent call last): [...] saving_listeners) File
"/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py",
line 1407, in _train_with_estimator_spec _, loss =
mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 676, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1171, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1270, in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise
value File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1255, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1327, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1091, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 929, in run run_metadata_ptr) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1152, in _run feed_dict_tensor, options, run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1328, in _do_run run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1348, in _do_call raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Cancelled To find out
more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?project=226138759195&resource=ml_job%2Fjob_id%2Ft_05_01_big_data1&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22t_05_01_big_data1%22

在日志中 Replica 6 显示以下错误:

command '['python3', '-m', 'object_detection.model_main', '--model_dir', 'gs://image-training/my_job_dir', '--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config', '--job-dir', 'gs://image-training/my_job_dir']' returned non-zero exit status 1

就在此之前:

worker-replica-6
Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.CancelledError: Cancelled During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 109, in <module> tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 638, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 648, in run_worker return self._start_distributed_training() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 789, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.CancelledError: Cancelled

知道如何防止这些工作失败吗?

标签: google-cloud-ml

解决方案


我似乎通过增加我正在使用的机器的数量和功率解决了这个问题。我将 YAML 文件更改为此,它运行了 50,000 步,没有任何问题。更贵,但至少它有效!:

trainingInput:
  scaleTier: CUSTOM
  # Configure a master worker with 4 K80 GPUs
  masterType: n1-highcpu-16
  masterConfig:
    acceleratorConfig:
      count: 4
      type: NVIDIA_TESLA_K80
  # Configure 9 workers, each with 4 K80 GPUs
  workerCount: 9
  workerType: n1-highcpu-16
  workerConfig:
    acceleratorConfig:
      count: 4
      type: NVIDIA_TESLA_K80
  # Configure 3 parameter servers with no GPUs
  parameterServerCount: 3
  parameterServerType: n1-highmem-8

有关完整说明,请参阅此页面:https ://cloud.google.com/ml-engine/docs/tensorflow/using-gpus


推荐阅读