首页 > 解决方案 > Keras ModelCheckpoint 在 TPU 上无法正常工作

问题描述

我正在使用 ModelCheckpoint 回调在 Kaggle 的 TPU 上训练模型。不幸的是,训练完成后,恢复模型的验证损失高于最后一个模型的验证损失(这与保存具有最小验证损失的模型相矛盾)。

tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()

tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

with tpu_strategy.scope():

...

    callback = [tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=12),
                    tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.6, patience=5, verbose=1),
                    tf.keras.callbacks.ModelCheckpoint(f"best{fold}.hdf5", monitor='val_loss', verbose=1, save_best_only=True)]
    
    model.fit(x=[train[BSSID_FEATS], train[RSSI_FEATS], train['site_id'], floors_train],
              y=[train.x, train.y], shuffle=True, use_multiprocessing=False,
              validation_data=([val[BSSID_FEATS], val[RSSI_FEATS], val['site_id'], floors_val],
                               [val.x, val.y]), batch_size=BATCH_SIZE, epochs=250,
              callbacks=callback, verbose=True)
    
    bestModel = keras.models.load_model(f"best{fold}.hdf5", custom_objects={'root_mean_squared_error': root_mean_squared_error})
    print(f"Best model's validation accuracy: {computeScore(bestModel, val, floors_val)}")
    print(f"Last model's validation accuracy: {computeScore(model, val, floors_val)}")

训练代码后会产生以下输出。

Best model's validation accuracy: 9.709675083992993
Last model's validation accuracy: 8.558022953960057

在 GPU 上运行时,结果是正确的。我应该怎么做才能让它在 TPU 上工作?

标签: tensorflowmachine-learningkerasdeep-learning

解决方案


推荐阅读