首页 > 解决方案 > 无法在 CloudML (1.8) 上保存 Keras 检查点错误:ImportError: `save_model` requires h5py

问题描述

在每个时代之后,我都有以下回调:

  1. 创建张量板。
  2. 保存模型检查点。

但是,在第一个训练阶段之后,我得到了以下回溯。我假设这与检查点回调有关。

这是正常的行为吗?

我的callbacks.py中创建了所有回调create_callbacks()

def create_callbacks(job_dir, logs_path):

    checkpoint_path = 'checkpoint.{epoch:04d}-{val_loss:.9f}.hdf5'

    if not job_dir.startswith("gs://"):  # then local
        checkpoint_path = os.path.join(job_dir + 'checkpoints', checkpoint_path)

    checkpoint = tf.keras.callbacks.ModelCheckpoint(checkpoint_path, monitor='val_loss', verbose=0, save_best_only=True,
                                 save_weights_only=False,
                                 mode='auto', period=1)

    tb = tf.keras.callbacks.TensorBoard(log_dir=logs_path, batch_size=None, histogram_freq=0, write_graph=False)

    # Continuous eval callback
    export = ContinuousExport(eval_frequency=1, job_dir=job_dir)

    return [checkpoint, tb, export]


class ContinuousExport(tf.keras.callbacks.Callback):
    """Continuous eval callback to evaluate the checkpoint once every so many epochs."""

    def __init__(self, eval_frequency, job_dir,):
        self.eval_frequency = eval_frequency
        self.job_dir = job_dir

    def on_epoch_end(self, epoch, logs={}):
        print('Epoch number is {}'.format(epoch))
        print('Frequency is {}'.format(self.eval_frequency))
        if epoch > 0 and epoch % self.eval_frequency == 0:
            # Unhappy hack to work around h5py not being able to write to GCS.
            # Force snapshots and saves to local filesystem, then copy them over to GCS.
            model_path_glob = 'checkpoint.*'
            if not self.job_dir.startswith("gs://"):
                model_path_glob = os.path.join(self.job_dir + 'checkpoints', model_path_glob)
            checkpoints = sorted(glob.glob(model_path_glob), key=os.path.getmtime)
            print('Path is {}'.format(model_path_glob))
            print('Length of cp is {}'.format(len(checkpoints)))
            if len(checkpoints) > 0:
                print(checkpoints[-1])
                if self.job_dir.startswith("gs://"):
                    print('Copying the model to {}'.format(self.job_dir + '/checkpoints/'))
                    copy_file_to_gcs(self.job_dir + '/checkpoints/', checkpoints[-1])
                else:
                    print('Using local storage, not saving to GCS')
        else:
            print('\nEvaluation epoch[{}] (no checkpoints found)'.format(epoch))


def copy_file_to_gcs(job_dir, file_path):
    with file_io.FileIO(file_path, mode='rb') as input_f:
        with file_io.FileIO(os.path.join(job_dir, file_path), mode='w+') as output_f:
            output_f.write(input_f.read())

INFO 2018-10-08 12:17:30 +0100 master-replica-0
模块完成;打扫干净。INFO 2018-10-08 12:17:30 +0100
master-replica-0 清理完成。错误 2018-10-08 12:18:23 +0100 服务副本主机 0 以非零状态 1 退出。错误 2018-10-08 12:18:23 +0100
服务回溯(最近一次调用最后):错误 2018-10-08 12:18:23 +0100 服务文件“/usr/lib/python3.5/runpy.py”,第 184 行,在 _run_module_as_main 错误 2018-10-08 12:18:23 +0100 服务“ main ", mod_spec) ERROR 2018-10-08 12:18:23 +0100 service
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code ERROR
2018-10-08 12:18:23 +0100 服务执行(代码,run_globals)错误 2018-10-08 12:18:23 +0100 服务
文件“/root/.local/lib/python3.5/site-packages /trainer/model.py",第 167 行,错误 2018-10-08 12:18:23 +0100 服务
train_model(train_file=train_file, test_file=test_file, job_dir=job_dir, **arguments) 错误 2018-10-08 12:18:23 +0100
服务文件“/root/.local/lib/python3.5/site-packages/trainer/model.py”,第 59 行,在 train_model 错误 2018-10-08 12:18:23 + 0100 服务
模型 = fit_model(模型,train_g,test_g,回调)错误
2018-10-08 12:18:23 +0100 服务文件“/root/.local/lib/python3.5/site-packages/trainer/model.py”,第 124 行,在 fit_model 错误 2018-10-08 12 :18:23 +0100 服务
model.fit_generator(**params) 错误 2018-10-08 12:18:23 +0100
服务文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/ keras/_impl/keras/engine/training.py",第 1598 行,在 fit_generator 中 错误 2018-10-08 12:18:23 +0100
服务 initial_epoch=initial_epoch) 错误 2018-10-08 12:18:23 +0100 服务fit_generator ERROR 2018-10-08 12:18:23 中的文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/training_generator.py”,第 231 行+0100
服务回调.on_epoch_end(epoch, epoch_logs) 错误
2018-10-08 12:18:23 +0100 服务文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/ callbacks.py",第 95 行,on_epoch_end 错误 2018-10-08 12:18:23 +0100
服务回调.on_epoch_end(epoch, logs) 错误
2018-10-08 12:18:23 +0100 服务文件“/usr /local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/callbacks.py”,第 468 行,on_epoch_end 错误 2018-10-08 12:18:23 +0100
服务 self.model .save(文件路径,覆盖=真)错误
2018-10-08 12:18:23 +0100 服务文件“/usr/local/lib/python3.5/dist-packages/tensorflow/python/keras/_impl/keras/engine/network.py”,第 1126 行,在保存 ERROR 2018-10-08 12:18:23 +0100 service
save_model(self, filepath, overwrite, include_optimizer) ERROR
2018-10-08 12:18:23 +0100 service File "/usr/local/lib/python3 .5/dist-packages/tensorflow/python/keras/_impl/keras/engine/saving.py",第 75 行,在 save_model 错误 2018-10-08 12:18:23 +0100 服务引发 ImportError('save_model需要 h5py。 ') 错误 2018-10-08 12:18:23 +0100 服务 ImportError:save_model 需要 h5py。

标签: pythontensorflowkeras

解决方案


是的,您需要安装软件包 h5py。

h5py 文件是存储训练模型的容器。如果您没有安装 h5py 包,则无法保存模型。

可以通过 PyPI 中的 pip 安装预先构建的 h5py 轮子

$ pip install h5py

推荐阅读