python - Tensorflow Cloud ML 对象检测 - 分布式训练的错误
问题描述
我正在尝试按照 Tensorflow 的对象检测教程进行分布式训练我自己的模型,但我使用的代码与存储库中的代码完全相同。
我对本教程进行了一些更改,特别是使用运行时 1.5 而不是本教程中所述的 1.2。当我尝试在 Google Cloud ML 上运行时,没有任何明确的错误(我可以看到),但该任务在没有经过培训的情况下很快退出。
这是我用来开始训练工作的命令:
gcloud ml-engine jobs submit training object_detection_`date +%s`
--job-dir=gs://test-bucket/training/
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz
--module-name object_detection.train
--region us-central1
--config ./config.yaml
--
--train_dir=gs://test-bucket/data/
--pipeline_config_path=gs://test-bucket/configs/ssd_inception_v2_coco.config
这是我的 config.yaml:
trainingInput:
runtimeVersion: "1.5"
scaleTier: CUSTOM
masterType: complex_model_l
workerCount: 9
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: large_model
最后,我的工作日志完成了:
I worker-replica-6 Clean up finished. worker-replica-6
I worker-replica-7 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-7
I worker-replica-7 Module completed; cleaning up. worker-replica-7
I worker-replica-7 Clean up finished. worker-replica-7
I worker-replica-8 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-8
I worker-replica-8 Module completed; cleaning up. worker-replica-8
I worker-replica-8 Clean up finished. worker-replica-8
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-1 Signal 15 (SIGTERM) was caught. Terminated by service. This is normal behavior. worker-replica-1
I worker-replica-1 Module completed; cleaning up. worker-replica-1
I worker-replica-1 Clean up finished. worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-3
I worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-0
I worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-2
I worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-5
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I worker-replica-3 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-3
I worker-replica-0 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-0
I worker-replica-2 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-2
I worker-replica-5 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-5
I worker-replica-1 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-1
I worker-replica-7 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-7
I worker-replica-8 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-8
I worker-replica-6 CreateSession still waiting for response from worker: /job:master/replica:0/task:0 worker-replica-6
I Finished tearing down TensorFlow.
I Job failed.
正如我所提到的,我无法从日志中获得有用的信息。再往前一点,我得到了这个错误Master init: Unavailable: Stream removed
,但不确定如何处理。感谢您朝着正确的方向努力!
解决方案
我转载了你的问题。我在此之后修复了它:
roysheffi 3 个月前评论了这个问题。嗨@pkulzc,我想我可能有线索:
在第 357 行,object_detection/trainer.py 调用 tf.contrib.slim.learning.train(),它使用已弃用的 tf.train.Supervisor 并且应该迁移到 tf.train.MonitoredTrainingSession,如 tf.train.Supervisor 中所述
这已在 tensorflow/tensorflow#15793 中提出请求,并在 yahoo/TensorFlowOnSpark#245 的最后一条评论中报告为 tensorflow/tensorflow#17852 的解决方案。[ 1 ]
所以,最后,我在 trainer.py 中做了这个:
- 把
tf.train.MonitoredTrainingSession(
代替slim.learning.train(
推荐阅读
- reactjs - 如何在本机反应中正确地从异步存储中获取数据?
- visual-studio-code - 无法建立与“xxx.xxx.xxx.xxx”的连接。权限被拒绝(公钥)
- c++ - HackerRank:第 1 天:C++ 中的数据类型
- node.js - 使用 Jest 在 Nodejs 中进行单元测试
- html - 选择选项菜单中的自动换行
- php - 无法从 PHPMailer 作曲家创建 PHPMailer 对象实例
- javascript - 寻找年度最佳球员 算法 JavaScript/Java
- amazon-qldb - QLDB - 索引存储与日志存储
- c# - 正则表达式获取数字和下划线C#之间的字符串
- postgresql - knex 与 PostgreSQL 使用 string_agg 选择获取预期 1 绑定,看到 0 错误