首页 > 解决方案 > 分布式 TensorFlow 是否需要分布式文件系统来保存检查点?

问题描述

我正在尝试根据这个 tf 官方教程运行一些示例代码。
我看了这个视频,真的很好。
正如上面视频中提到的,主要工作人员负责保存检查点,由 tf.train.MonitoredTrainingSession 实现。
然后我以为只有首席工人需要一个目录来保存检查点。
当我在 machine1 上使用 ps0 运行代码时,在 machine2 上使用 worker0 运行代码,一切似乎都正常。
但是当我在 machine1 上使用 ps0、worker0、在 machine2 上使用 ps1 和 worker1 运行时,就会出现错误,并且 worker0 的日志中的错误如下:

Traceback (most recent call last):
File "distributed_train.py", line 136, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "distributed_train.py", line 97, in main
    hooks=hooks) as mon_sess:
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 415, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 826, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 549, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1012, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1017, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 712, in create_session
    hook.after_create_session(self.tf_sess, self.coord)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 450, in after_create_session
    self._save(session, global_step)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/basic_session_run_hooks.py", line 481, in _save
    self._get_saver().save(session, self._save_path, global_step=step)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1669, in save
    raise exc
tensorflow.python.framework.errors_impl.NotFoundError: ./train_dir/dist_worker_0/model.ckpt-0_temp_cf2b45f059b74507a65cae9b7a9ea5b4; No such file or directory
     [[Node: save/SaveV2_1 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](save/ShardedFilename_1, save/SaveV2_1/tensor_names, save/SaveV2_1/shape_and_slices, conv1/biases, conv1/biases/Adagrad, conv2/biases, conv2/biases/Adagrad, local3/biases, local3/biases/Adagrad, local4/biases, local4/biases/Adagrad, softmax_linear/biases, softmax_linear/biases/Adagrad)]]

Caused by op u'save/SaveV2_1', defined at:
  File "distributed_train.py", line 136, in <module>
    tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "distributed_train.py", line 97, in main
    hooks=hooks) as mon_sess:
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 415, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 826, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 549, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1012, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1017, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 706, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 468, in create_session
    self._scaffold.finalize()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 212, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 856, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1284, in __init__
    self.build()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1296, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1333, in _build
    build_save=build_save, build_restore=build_restore)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 772, in _build_internal
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 363, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 337, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/s`enter code here`aver.py", line 278, in _AddSaveOps
   save = self.save_op(filename_tensor, saveables)

File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 194, in save_op
    tensors)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1687, in save_v2
    shape_and_slices=shape_and_slices, tensors=tensors, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

NotFoundError (see above for traceback): ./train_dir/dist_worker_0/model.ckpt-0_temp_cf2b45f059b74507a65cae9b7a9ea5b4; No such file or directory
     [[Node: save/SaveV2_1 = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:ps/replica:0/task:1/device:CPU:0"](save/ShardedFilename_1, save/SaveV2_1/tensor_names, save/SaveV2_1/shape_and_slices, conv1/biases, conv1/biases/Adagrad, conv2/biases, conv2/biases/Adagrad, local3/biases, local3/biases/Adagrad, local4/biases, local4/biases/Adagrad, softmax_linear/biases, softmax_linear/biases/Adagrad)]]

但是目录 ./train_dir/dist_worker_0/model.ckpt-0_temp_cf2b45f059b74507a65cae9b7a9ea5b4 确实存在(在机器1上)。

部分代码(其实来自官方教程):

        # The MonitoredTrainingSession takes care of session initialization,
        # restoring from a checkpoint, saving to a checkpoint, and closing when done
        # or an error occurs.
        with tf.train.MonitoredTrainingSession(
                master=server.target,
                config=config,
                is_chief=(FLAGS.task_index == 0),
                checkpoint_dir="./train_dir/dist_{0}_{1}".format(FLAGS.job_name,
                                                             FLAGS.task_index),
                hooks=hooks) as mon_sess:
            while not mon_sess.should_stop():
                # Run a training step asynchronously.
                # See <a href="./../api_docs/python/tf/train/SyncReplicasOptimizer"><code>tf.train.SyncReplicasOptimizer</code></a> for additional details on how to
                # perform *synchronous* training.
                # mon_sess.run handles AbortedError in case of preempted PS.
                mon_sess.run(train_op)

我在stackoverflow上搜索了一些问题和github上的问题,类似问题的答案建议使用HDFS。
“首席工人负责保存检查点”不是说我只需要首席工人所在机器上的本地目录吗?我是不是误会了什么?我真的需要使用 HDFS 之类的吗?

标签: pythontensorflow

解决方案


推荐阅读