首页 > 解决方案 > 在 TPU 上训练模型 XLNet(或 Bert)时如何解决“OutOfRangeError: End of sequence”错误?

问题描述

我在 TPU 上运行 XLNet 模型时遇到了 OutOfRangeError。我搜索了一些类似的问题,发现这个问题可能是由于缺乏计算资源(如内存)引起的。但是,我将 Google Cloud VM 实体调整为更高的性能阶段。错误仍然存​​在。我该如何解决这样的问题。

错误回溯如下所示:

I0621 14:16:40.713411 140181362325248 tpu_estimator.py:463] Starting infeed thread controller.
I0621 14:16:40.713979 140181327095552 tpu_estimator.py:482] Starting outfeed thread controller.
I0621 14:16:42.156764 140182961767872 tpu_estimator.py:536] Enqueue next (600) batch(es) of data to infeed.
I0621 14:16:42.157500 140182961767872 tpu_estimator.py:540] Dequeue next (600) batch(es) of data from outfeed.
I0621 14:16:51.020121 140181362325248 error_handling.py:70] Error recorded from infeed: End of sequence
[[node input_pipeline_task0/while/IteratorGetNext (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
Caused by op u'input_pipeline_task0/while/IteratorGetNext', defined at:
File "run_classifier.py", line 903, in 
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_classifier.py", line 767, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2547, in _model_fn
input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1167, in generate_infeed_enqueue_ops_and_dequeue_fn
self._invoke_input_fn_and_record_structure())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1271, in _invoke_input_fn_and_record_structure
wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2945, in _wrap_computation_in_while_loop
parallel_iterations=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2934, in computation
with ops.control_dependencies(op_fn()):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 859, in enqueue_ops_fn
features, labels = inputs.features_and_labels() # Calls get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 3127, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 414, in get_next
output_shapes=self._structure._flat_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1685, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

OutOfRangeError (see above for traceback): End of sequence
[[node input_pipeline_task0/while/IteratorGetNext (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]

I0621 16:07:06.619868 139957928773056 monitored_session.py:1180] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

标签: pythontensorflownlp

解决方案


推荐阅读