首页 > 解决方案 > 如何修复tensorflow分布训练错误

问题描述

我只是按照mnist_replica.py编写分发代码,当运行程序时出现错误。它总是报告RuntimeError:Init操作没有使模型准备好。如何解决?似乎当它调用 prepare_or_wait_for_session 并失败代码如下:

if FLAGS.sync_replicas:
           if FLAGS.replicas_to_aggregate is None:
               replicas_to_aggregate = self.num_workers
           else:
               replicas_to_aggregate = FLAGS.replicas_to_aggregate
           self.opt = tf.train.SyncReplicasOptimizer(
               self.opt,
               replicas_to_aggregate=replicas_to_aggregate,
               total_num_replicas=self.num_workers,
               name="sync")

       self.optimizer = self.opt.minimize(self.loss, global_step=self.global_step)

       # init
       if FLAGS.sync_replicas:
           self.local_init_op = self.opt.local_step_init_op  
           if self.is_chief:
               self.local_init_op = self.opt.chief_init_op  

           self.ready_for_local_init_op = self.opt.ready_for_local_init_op 
           self.chief_queue_runner = self.opt.get_chief_queue_runner()  
           self.sync_init_op = self.opt.get_init_tokens_op()  

       self.global_var_init_op = tf.global_variables_initializer()
       self.train_auc_value, self.train_auc_op = tf.metrics.auc(self.label, self.out, name="train_auc" + str(FLAGS.task_index))
       self.valid_auc_value, self.valid_auc_op = tf.metrics.auc(self.label, self.out, name="valid_auc" + str(FLAGS.task_index))
       self.local_var_init_op = tf.local_variables_initializer()
if FLAGS.sync_replicas:
           sv = tf.train.Supervisor(
               is_chief=is_chief,
               logdir=train_dir,
               init_op=deepfm.global_var_init_op,
               local_init_op=deepfm.local_init_op,
               ready_for_local_init_op=deepfm.ready_for_local_init_op,
               recovery_wait_secs=1,
               global_step=deepfm.global_step)
sess = sv.prepare_or_wait_for_session(server.target, config=sess_config)

标签: tensorflow

解决方案


推荐阅读