python - 在 TPU 上训练模型 XLNet(或 Bert)时如何解决“OutOfRangeError: End of sequence”错误?
问题描述
我在 TPU 上运行 XLNet 模型时遇到了 OutOfRangeError。我搜索了一些类似的问题,发现这个问题可能是由于缺乏计算资源(如内存)引起的。但是,我将 Google Cloud VM 实体调整为更高的性能阶段。错误仍然存在。我该如何解决这样的问题。
错误回溯如下所示:
I0621 14:16:40.713411 140181362325248 tpu_estimator.py:463] Starting infeed thread controller.
I0621 14:16:40.713979 140181327095552 tpu_estimator.py:482] Starting outfeed thread controller.
I0621 14:16:42.156764 140182961767872 tpu_estimator.py:536] Enqueue next (600) batch(es) of data to infeed.
I0621 14:16:42.157500 140182961767872 tpu_estimator.py:540] Dequeue next (600) batch(es) of data from outfeed.
I0621 14:16:51.020121 140181362325248 error_handling.py:70] Error recorded from infeed: End of sequence
[[node input_pipeline_task0/while/IteratorGetNext (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
Caused by op u'input_pipeline_task0/while/IteratorGetNext', defined at:
File "run_classifier.py", line 903, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_classifier.py", line 767, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2251, in _call_model_fn
config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2547, in _model_fn
input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1167, in generate_infeed_enqueue_ops_and_dequeue_fn
self._invoke_input_fn_and_record_structure())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1271, in _invoke_input_fn_and_record_structure
wrap_fn(device=host_device, op_fn=enqueue_ops_fn))
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2945, in _wrap_computation_in_while_loop
parallel_iterations=1)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3556, in while_loop
return_same_structure)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3087, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3022, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2934, in computation
with ops.control_dependencies(op_fn()):
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 859, in enqueue_ops_fn
features, labels = inputs.features_and_labels() # Calls get_next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 3127, in features_and_labels
return _Inputs._parse_inputs(self._iterator.get_next())
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/data/ops/iterator_ops.py", line 414, in get_next
output_shapes=self._structure._flat_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1685, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()
OutOfRangeError (see above for traceback): End of sequence
[[node input_pipeline_task0/while/IteratorGetNext (defined at /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1112) ]]
I0621 16:07:06.619868 139957928773056 monitored_session.py:1180] An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
解决方案
推荐阅读
- node.js - Mongoose 仅使用一个查询更新平均字段
- node.js - 如何使用 node.js 解决 ngnix 中的 404 错误
- ios - 在 ContainerView 中打开不同的视图
- angular - 使用 amexio 开发 Angular 应用程序
- angular - 我应该在 Jasmine 3 中使用什么来代替 fit 和 fdescribe?
- ios - Swift:使用 UIButtons 作为 UIImages
- python - NumPy 读取/写入文件性能(特别是 ndarray.tofile)
- mysql - 转换 '1|2|India' 将此管道分隔的字符串设置为列
- labview - 如何在LabVIEW中对同一个控件进行多个属性读取和写入?
- objective-c - Objective-c:使用块的方法如何返回对象