google-cloud-ml - cloud-ml 作业在运行数千步后取消
问题描述
我成功地开始了 Google Cloud 中的培训工作。但是,在运行 30 分钟到 1 小时和几千步后,它们会以无意义的错误消息结束:“CancelledError: Cancelled”。
我正在对分布在 16 个 tfrecord 文件中的约 30K 图像进行训练。在单个文件(约 5K 左右)中对较少数量的图像进行训练时,我没有这个问题
以下是详细信息:我使用以下命令开始工作:
gcloud ai-platform jobs submit training my_job_name \
--runtime-version 1.13 \
--job-dir=gs://image-training/my_job_dir \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,dist/pycocotools-2.0.tar.gz \
--module-name object_detection.model_main \
--region us-east1 --config object_detection/CLOUDgpu.yaml \
--python-version 3.5 \
-- \
--model_dir gs://image-training/my_job_dir \
--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config
这是我的 YAML 文件:
trainingInput:
runtimeVersion: "1.13"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 9
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
我的配置文件引用这样的数据文件:
train_input_reader: {
tf_record_input_reader {
input_path: "gs://image-training/t0423data/train_*_re.tfrecord"
}
num_readers:3
label_map_path: "gs://image-training/PigCount/label_map.pbtxt"
}
最后,完整的错误:
The replica worker 6 exited with a non-zero status of 1. Termination reason:
Error. Traceback (most recent call last): [...] saving_listeners) File
"/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py",
line 1407, in _train_with_estimator_spec _, loss =
mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 676, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1171, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1270, in run raise six.reraise(*original_exc_info) File
"/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise
value File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1255, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1327, in run run_metadata=run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py",
line 1091, in run return self._sess.run(*args, **kwargs) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 929, in run run_metadata_ptr) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1152, in _run feed_dict_tensor, options, run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1328, in _do_run run_metadata) File
"/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py",
line 1348, in _do_call raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.CancelledError: Cancelled To find out
more about why your job exited please check the logs:
https://console.cloud.google.com/logs/viewer?project=226138759195&resource=ml_job%2Fjob_id%2Ft_05_01_big_data1&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22t_05_01_big_data1%22
在日志中 Replica 6 显示以下错误:
command '['python3', '-m', 'object_detection.model_main', '--model_dir', 'gs://image-training/my_job_dir', '--pipeline_config_path=gs://image-training/ssd_inception_v2_coco_2018_01_28/ssd_inception_v2_CLOUD.config', '--job-dir', 'gs://image-training/my_job_dir']' returned non-zero exit status 1
就在此之前:
worker-replica-6
Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.CancelledError: Cancelled During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main "__main__", mod_spec) File "/usr/lib/python3.5/runpy.py", line 85, in _run_code exec(code, run_globals) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 109, in <module> tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "/root/.local/lib/python3.5/site-packages/object_detection/model_main.py", line 105, in main tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0]) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate return executor.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 638, in run getattr(self, task_to_run)() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 648, in run_worker return self._start_distributed_training() File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/training.py", line 789, in _start_distributed_training saving_listeners=saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default saving_listeners) File "/usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(*original_exc_info) File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.CancelledError: Cancelled
知道如何防止这些工作失败吗?
解决方案
我似乎通过增加我正在使用的机器的数量和功率解决了这个问题。我将 YAML 文件更改为此,它运行了 50,000 步,没有任何问题。更贵,但至少它有效!:
trainingInput:
scaleTier: CUSTOM
# Configure a master worker with 4 K80 GPUs
masterType: n1-highcpu-16
masterConfig:
acceleratorConfig:
count: 4
type: NVIDIA_TESLA_K80
# Configure 9 workers, each with 4 K80 GPUs
workerCount: 9
workerType: n1-highcpu-16
workerConfig:
acceleratorConfig:
count: 4
type: NVIDIA_TESLA_K80
# Configure 3 parameter servers with no GPUs
parameterServerCount: 3
parameterServerType: n1-highmem-8
有关完整说明,请参阅此页面:https ://cloud.google.com/ml-engine/docs/tensorflow/using-gpus
推荐阅读
- angular - Angular 将一个可观察的嵌套对象映射到另一个对象
- javascript - 我网站上的 Twitter 关注按钮 - Flash of Unstyled Content
- php - WooCommerce 受保护的下载文件夹引发 403 禁止错误
- python - Python sleep() 函数无法正常工作
- c - uint64_t 到数组 - C 语言
- python - Pygame 不会 blit 图像列表
- c++ - 无法停止服务并出现错误 ERROR_SERVICE_CANNOT_ACCEPT_CTRL
- python - 异步 Python。长 asyncio.sleep 间隔如何影响性能和内存消耗?
- jquery - Jquery - 包含效果的跨度都在页面上
- c# - 嵌套的 For 循环永远运行?