tensorflow - tensorflow 恢复模型并进行推理失败,出现数据集迭代器初始化失败错误
问题描述
我恢复了训练模型如下
saver = tf.train.import_meta_graph('expr1.multi/train_logs/model.ckpt-44.meta')
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
saver.restore(sess,'expr1.multi/train_logs/model.ckpt-44')
然后检索推理所需的张量
logits = graph.get_tensor_by_name('strided_slice_1:0')
logits_len = graph.get_tensor_by_name('strided_slice_2:0')
targets = graph.get_tensor_by_name('evaluate/IteratorGetNext:2')
targets_len = graph.get_tensor_by_name('evaluate/IteratorGetNext:3')
# this is to retrieve the dataset.iterator.initializer operator
init_op = graph.get_operation_by_name('evaluate/MakeIterator')
然后做推理
sess.run(init_op)
while True:
try:
l, ll, t, tl = sess.run([logits, logits_len, targets, targets_len])
...
except tf.errors.OutOfRangeError:
如果模型在模型恢复和推理之上使用单个 gpu 训练训练,则可以正常工作而不会出现问题。但是,使用以下多个 gpu 训练(异步)实现,它失败了
loss_ops = []
train_ops = []
for gpu_i in range(self.num_gpus):
with tf.device("/gpu:%d" % gpu_i):
loss = ...
# update the model parameter
update = self.model.update(loss, global_step, self.lrate, self.grad_clip)
loss_ops.append(loss)
train_ops.append(update)
# within this Dataset pipe is created for evaluation data
evaluator = get_evaluator(self.evaluator)(self.conf, self.model)
mon_sess = tf.train.MonitoredTrainingSession(config=config, hooks=hooks)
...
def train_helper(train_op, loss_op):
...
while not mon_sess.should_stop() and has_data:
try:
_, lossVal, step = mon_sess.run([train_op, loss_op, global_step])
except tf.errors.OutOfRangeError:
has_data = False
continue
#validation loss evaluation every eval_steps steps
if not step == 0 and step % self.eval_steps == 0:
# one shot iterator initialization done inside evaluate function
val_lossVal, num_utts = evaluator.evaluate(mon_sess)
...
train_threads = []
for t_op, loss_op in zip(train_ops, loss_ops):
train_threads.append(threading.Thread(target=train_helper, args=(t_op, loss_op)))
# Start the threads, and block on their completion.
for t in train_threads:
t.start()
for t in train_threads:
t.join()
下面的数据集管道迭代器初始化失败错误
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() 失败,因为迭代器尚未初始化。确保在获取下一个元素之前已为此迭代器运行了初始化程序操作。
我无法弄清楚原因。
解决方案
推荐阅读
- c++ - C++14 中 std::initializer_list 对象的预期生命周期是多少?
- git - 虚拟环境文件夹似乎被 git 跟踪
- mongodb - BSON 的自定义编组,类型为字符串
- android - 如何从 Firebase 存储中检索图像预览?
- xamarin.forms - 时间需要在 xamarin 表单中验证时间选择器
- java - 片段继续从先前的执行中获取额外的
- node.js - GCP Pubsub Nodejs 客户端承诺挂起,客户端冻结,没有错误
- hive - Hiveserver2 从不启动但没有错误
- mobile - 如何更改 Leaflet 的地图图层渲染器填充?
- css - react-bootstrap 列未正确定位