首页 > 解决方案 > 我的测试损失达到数百万是否正常

问题描述

我在第二次迭代中通过多次迭代(训练、保存和再次训练)训练模型,我的 val_loss 出于某种原因达到了数百万。我如何导入模型有问题吗?

这就是我第一次运行后保存初始模型的方式

model.save('/content/drive/My Drive/Colab Notebooks/path/to/save/locaiton',save_format='tf')

这就是我导入和覆盖它的方式

def retrainmodel(model_path,tr_path,v_path):
  image_size = 224
  BATCH_SIZE_TRAINING = 10
  BATCH_SIZE_VALIDATION = 10
  BATCH_SIZE_TESTING = 1
  EARLY_STOP_PATIENCE = 6
  STEPS_PER_EPOCH_TRAINING = 10
  STEPS_PER_EPOCH_VALIDATION = 10
  NUM_EPOCHS = 20 

  model = tf.keras.models.load_model(model_path)

  data_generator = ImageDataGenerator(preprocessing_function=preprocess_input)


  train_generator = data_generator.flow_from_directory(tr_path,
        target_size=(image_size, image_size),
        batch_size=BATCH_SIZE_TRAINING,
        class_mode='categorical')
  
  validation_generator = data_generator.flow_from_directory(v_path,
        target_size=(image_size, image_size),
        batch_size=BATCH_SIZE_VALIDATION,
        class_mode='categorical') 
  
  cb_early_stopper = EarlyStopping(monitor = 'val_loss', patience = EARLY_STOP_PATIENCE)
  cb_checkpointer = ModelCheckpoint(filepath = 'path/to/checkpoint/folder', monitor = 'val_loss', save_best_only = True, mode = 'auto')

  fit_history = model.fit(
        train_generator,
        steps_per_epoch=STEPS_PER_EPOCH_TRAINING,
        epochs = NUM_EPOCHS,
        validation_data=validation_generator,
        validation_steps=STEPS_PER_EPOCH_VALIDATION,
        callbacks=[cb_checkpointer, cb_early_stopper]
  )

  model.save('/content/drive/My Drive/Colab Notebooks/path/to/save/locaiton',save_format='tf')
this is my output after passing my directories onto this function

Found 1421 images belonging to 5 classes.
Found 305 images belonging to 5 classes.
Epoch 1/20
10/10 [==============================] - 233s 23s/step - loss: 2.3330 - acc: 0.7200 - val_loss: 4.6237 - val_acc: 0.4400
Epoch 2/20
10/10 [==============================] - 171s 17s/step - loss: 2.7988 - acc: 0.5900 - val_loss: 56996.6289 - val_acc: 0.6800
Epoch 3/20
10/10 [==============================] - 159s 16s/step - loss: 1.2776 - acc: 0.6800 - val_loss: 8396707.0000 - val_acc: 0.6500
Epoch 4/20
10/10 [==============================] - 144s 14s/step - loss: 1.4562 - acc: 0.6600 - val_loss: 2099639.7500 - val_acc: 0.7200
Epoch 5/20
10/10 [==============================] - 126s 13s/step - loss: 1.0970 - acc: 0.7033 - val_loss: 50811.5781 - val_acc: 0.7300
Epoch 6/20
10/10 [==============================] - 127s 13s/step - loss: 0.7326 - acc: 0.8000 - val_loss: 84781.5703 - val_acc: 0.7000
Epoch 7/20
10/10 [==============================] - 110s 11s/step - loss: 1.2356 - acc: 0.7100 - val_loss: 1000.2982 - val_acc: 0.7300

这是我的优化器:

sgd = optimizers.SGD(lr = 0.01, decay = 1e-6, momentum = 0.9, nesterov = True)
model.compile(optimizer = sgd, loss = 'categorical_crossentropy', metrics = 'acc') 

你觉得我哪里错了?

我正在批量训练我的模型,因为我正在使用总共 22K 图像的 google colab 工作,所以这些结果是在为网络提供 2800 个训练图像之后得出的。如果我给它提供更多图像,你认为它会自行解决,还是有严重问题?

标签: image-processingdeep-learningneural-networkconv-neural-networktransfer-learning

解决方案


我觉得有这个损失是不好的。当我们加载模型并重新训练它时,在最初的几个时期内有更高的损失是合乎逻辑的。但是,这个损失值不应该像你的情况一样拍摄到星星。如果在保存时,损失值大约为 0.5,那么当您加载相同的模型进行再训练时,它不应高于先前值的 10 倍,因此,预期值为 5 +- 1。[注意:这纯粹是基于经验。没有预先知道损失的通用方法。]

如果你的损失太高,以下是合理的:

  1. 变化的数据集——改变训练数据集的动态可能会迫使模型出现这种行为。

  2. 模型保存可能改变了权重

解决方案建议:

  1. 尝试在模型上使用 save_weights 而不是 save 方法

     model.save_weights('path/to/filename.h5')
    

    另外,使用 load_weights 而不是 load_model

     model = call_cnn_function_to_build_model()
     model.compile(... your args ...)
     model = model.load_weights('path/to/filename.h5')
    
  2. 由于您有检查点,请尝试使用检查点保存的模型。因此,不要尝试从最后一个 epoch 附近的检查点加载模型,而不是最终模型。

PS:更正感激地接受。


推荐阅读