tensorflow - 训练和验证准确度增加和训练损失正在减少 - 验证损失为 NaN
问题描述
我正在训练猫与狗数据的分类器模型。该模型是 ResNet18 的一个次要变体,并返回类的 softmax 概率。但是,我注意到验证损失主要是 NaN,而训练损失正在稳步减少并且表现如预期。训练和验证的准确性逐个时代增加。
Epoch 1/15
312/312 [==============================] - 1372s 4s/step - loss: 0.7849 - accuracy: 0.5131 - val_loss: nan - val_accuracy: 0.5343
Epoch 2/15
312/312 [==============================] - 1372s 4s/step - loss: 0.6966 - accuracy: 0.5539 - val_loss: 13989871201999266517090304.0000 - val_accuracy: 0.5619
Epoch 3/15
312/312 [==============================] - 1373s 4s/step - loss: 0.6570 - accuracy: 0.6077 - val_loss: 747123703808.0000 - val_accuracy: 0.5679
Epoch 4/15
312/312 [==============================] - 1372s 4s/step - loss: 0.6180 - accuracy: 0.6483 - val_loss: nan - val_accuracy: 0.6747
Epoch 5/15
312/312 [==============================] - 1373s 4s/step - loss: 0.5838 - accuracy: 0.6852 - val_loss: nan - val_accuracy: 0.6240
Epoch 6/15
312/312 [==============================] - 1372s 4s/step - loss: 0.5338 - accuracy: 0.7301 - val_loss: 31236203781405710523301888.0000 - val_accuracy: 0.7590
Epoch 7/15
312/312 [==============================] - 1373s 4s/step - loss: 0.4872 - accuracy: 0.7646 - val_loss: 52170.8672 - val_accuracy: 0.7378
Epoch 8/15
312/312 [==============================] - 1372s 4s/step - loss: 0.4385 - accuracy: 0.7928 - val_loss: 2130819335420217655296.0000 - val_accuracy: 0.8101
Epoch 9/15
312/312 [==============================] - 1373s 4s/step - loss: 0.3966 - accuracy: 0.8206 - val_loss: 116842888.0000 - val_accuracy: 0.7857
Epoch 10/15
312/312 [==============================] - 1372s 4s/step - loss: 0.3643 - accuracy: 0.8391 - val_loss: nan - val_accuracy: 0.8199
Epoch 11/15
312/312 [==============================] - 1373s 4s/step - loss: 0.3285 - accuracy: 0.8557 - val_loss: 788904.2500 - val_accuracy: 0.8438
Epoch 12/15
312/312 [==============================] - 1372s 4s/step - loss: 0.3029 - accuracy: 0.8670 - val_loss: nan - val_accuracy: 0.8245
Epoch 13/15
312/312 [==============================] - 1373s 4s/step - loss: 0.2857 - accuracy: 0.8781 - val_loss: 121907.8594 - val_accuracy: 0.8444
Epoch 14/15
312/312 [==============================] - 1373s 4s/step - loss: 0.2585 - accuracy: 0.8891 - val_loss: nan - val_accuracy: 0.8674
Epoch 15/15
312/312 [==============================] - 1374s 4s/step - loss: 0.2430 - accuracy: 0.8965 - val_loss: 822.7968 - val_accuracy: 0.8776
我检查了以下内容 -
- 验证数据中的无穷大/NaN
- 归一化数据时导致的无穷大/NaN(使用
tf.keras.applications.resnet.preprocess_input
) - 如果模型只预测一个类别并因此导致损失函数表现异常
培训代码供参考 -
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-3)
model = Resnet18(NUM_CLASSES=NUM_CLASSES) # variant of original model
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
train_dataset,
steps_per_epoch=len(X_train) // BATCH_SIZE,
epochs=EPOCHS,
validation_data=valid_dataset,
validation_steps=len(X_valid) // BATCH_SIZE,
verbose=1,
)
我找到的最相关的答案是此处接受的答案的最后一段。但是,这里的情况似乎并非如此,因为与训练损失和回报相比,验证损失按数量级发散。似乎损失函数行为不端。
解决方案
推荐阅读
- reactjs - 如何在本机反应中从列表中更改所选文本的样式
- python - 在 Tkinter 画布中绘制的线条的饱和度
- java-8 - 获取两个本地日期实例之间的期间,例如 P1Y2M10DT2H30M
- r - R中的排列和组合
- ruby-on-rails - 按子属性 Ruby on Rails 选择和排序父记录
- bootstrap-4 - 将内容对齐到小型设备的中心,Bootstrap 4
- java - 如何显示时间戳之间超过 24 小时的时差
- python - 从 excel 文件和 Pandas (python) 绘制更多图表
- php - 我创建了一个包含多个页面的网站,其中包括 HTML、PHP 和 javascript。有没有办法将所有这些上传到 WordPress?
- html - 位置固定时colspan不起作用