tensorflow - 如何修复tensorflow分布训练错误
问题描述
我只是按照mnist_replica.py编写分发代码,当运行程序时出现错误。它总是报告RuntimeError:Init操作没有使模型准备好。如何解决?似乎当它调用 prepare_or_wait_for_session 并失败代码如下:
if FLAGS.sync_replicas:
if FLAGS.replicas_to_aggregate is None:
replicas_to_aggregate = self.num_workers
else:
replicas_to_aggregate = FLAGS.replicas_to_aggregate
self.opt = tf.train.SyncReplicasOptimizer(
self.opt,
replicas_to_aggregate=replicas_to_aggregate,
total_num_replicas=self.num_workers,
name="sync")
self.optimizer = self.opt.minimize(self.loss, global_step=self.global_step)
# init
if FLAGS.sync_replicas:
self.local_init_op = self.opt.local_step_init_op
if self.is_chief:
self.local_init_op = self.opt.chief_init_op
self.ready_for_local_init_op = self.opt.ready_for_local_init_op
self.chief_queue_runner = self.opt.get_chief_queue_runner()
self.sync_init_op = self.opt.get_init_tokens_op()
self.global_var_init_op = tf.global_variables_initializer()
self.train_auc_value, self.train_auc_op = tf.metrics.auc(self.label, self.out, name="train_auc" + str(FLAGS.task_index))
self.valid_auc_value, self.valid_auc_op = tf.metrics.auc(self.label, self.out, name="valid_auc" + str(FLAGS.task_index))
self.local_var_init_op = tf.local_variables_initializer()
if FLAGS.sync_replicas:
sv = tf.train.Supervisor(
is_chief=is_chief,
logdir=train_dir,
init_op=deepfm.global_var_init_op,
local_init_op=deepfm.local_init_op,
ready_for_local_init_op=deepfm.ready_for_local_init_op,
recovery_wait_secs=1,
global_step=deepfm.global_step)
sess = sv.prepare_or_wait_for_session(server.target, config=sess_config)
解决方案
推荐阅读
- python - 用于解析 proto 文件并生成纯 C 结构的脚本
- machine-learning - 如何使用 pytorch nn.Transformer 进行序列分类?
- ruby-on-rails - 如何在 ruby on rails 中获取最新创建的对象
- mongodb - 将 Mongodb 日志发送到 AWS Cloudwatch
- three.js - 如何在三个 JS 中单击按钮更改多个视图
- javascript - 在同一页面上结合 vis-network 和 vis-timeline
- c - 将指针传递给函数并更改它指向的地址
- android - android studio 按钮可见性
- python - 需要在googlesite中添加员工搜索功能
- android-databinding - 用于双向数据绑定的三元运算符