首页 > 解决方案 > TensorFlow saver.save() 无法始终如一地工作 - 导致核心转储和“在抛出 'std::length_error' 的实例后终止调用”

问题描述

我正在使用需要大量内存密集型输入数据的 Tensorflow 训练模型,因此我使用 for 循环递归加载新的训练数据。在每个循环结束时,我使用model.save()以下方法保存模型:

 def save(self, sess=None):
    if not sess:
        raise AttributeError("TensorFlow session not provided.")
    saver = tf.train.Saver(self.vars)
    save_path = saver.save(sess, "./tmp/%s.ckpt" % self.name)
    print("Model saved in file: %s" % save_path)

然后我在循环的下一次迭代中通过实例化模型对象然后调用来重新加载模型,model.load()如下所示:

    def load(self, sess=None):
        if not sess:
             raise AttributeError("TensorFlow session not provided.")
        saver = tf.train.import_meta_graph('./tmp/%s.ckpt.meta' % self.name)
        save_path = "./tmp/%s.ckpt" % self.name
        saver.restore(sess,tf.train.latest_checkpoint('./tmp/'))
        print("Model restored from file: %s" % save_path)

通常,当我遍历训练数据集时,模型将毫无问题地保存和加载。

但是,在循环的某些迭代中,保存模型时出现以下错误:

terminate called after throwing an instance of 'std::length_error'
what():  basic_string::append
Fatal Python error: Aborted

Thread 0x00007fd3a5080700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 295 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/queue.py", line 164 in get
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fd55f4ed700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 295 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/queue.py", line 164 in get
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 916 in _bootstrap_inner  File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fd499374700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 295 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/queue.py", line 164 in get
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fd57cf17700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 295 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/queue.py", line 164 in get
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fd14affd700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 295 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/queue.py", line 164 in get
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fd14b7fe700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 295 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/queue.py", line 164 in get
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py", line 159 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fd62af20700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 299 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 551 in wait
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tqdm/_monitor.py", line 69 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 916 in _bootstrap_inner
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007fd63926c700 (most recent call first):
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3166 in _as_graph_def
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3238 in as_graph_def
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1246 in export_meta_graph
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py", line 1203 in save
File "/home/dxcl/graph_model/models.py", line 85 in save
File "/home/dxcl/graph_model/unsupervised_train.py", line 351 in train
File "/home/dxcl/graph_model/unsupervised_train.py", line 362 in main
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/absl/app.py", line 250 in _run_main
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/absl/app.py", line 299 in run
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
File "/home/dxcl/graph_model/unsupervised_train.py", line 365 in <module>
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/runpy.py", line 85 in _run_code
File "/home/dxcl/conda_envs/py_env_2/lib/python3.6/runpy.py", line 193 in _run_module_as_main
Aborted (core dumped)

我注意到当我减少每个循环中输入的训练数据量时不会发生这种情况,因此这可能是与内存相关的问题。我只是不明白为什么它会在某些情况下起作用,但在其他情况下却不行?

(我正在运行 tensorflow-gpu 1.15,并在 JupyterLab 终端中执行此操作。很高兴提供任何其他相关细节或代码!)

标签: pythonmultithreadingtensorflowjupyter-labcoredump

解决方案


推荐阅读