首页 > 解决方案 > 模型在训练时,经过某个步骤后显示为 loss = nan

问题描述

我正在研究 TensorFlow 对象检测。我正在使用谷歌 Colab。模型在训练时,经过某个步骤后显示为 loss = nan。我怎样才能解决这个问题?

型号=ssd_efficientdet_d2

输出=

I1125 09:30:20.814607 139701278168960 model_lib_v2.py:652] Step 1400 per-step time 0.418s loss=1.650 INFO:tensorflow:Step 1500 per-step time 0.601s loss=1.285

I1125 09:31:09.918310 139701278168960 model_lib_v2.py:652] Step 1500 per-step time 0.601s loss=1.285
INFO:tensorflow:Step 1500 per-step time 0.601s loss=1.285

I1125 09:31:09.918310 139701278168960 model_lib_v2.py:652] Step 1500 per-step time 0.601s loss=1.285
INFO:tensorflow:Step 1600 per-step time 0.444s loss=1.344

I1125 09:31:59.594189 139701278168960 model_lib_v2.py:652] Step 1600 per-step time 0.444s loss=1.344
INFO:tensorflow:Step 1700 per-step time 0.511s loss=nan

I1125 09:32:49.015780 139701278168960 model_lib_v2.py:652] Step 1700 per-step time 0.511s loss=nan
INFO:tensorflow:Step 1800 per-step time 0.576s loss=nan

I1125 09:33:39.257319 139701278168960 model_lib_v2.py:652] Step 1800 per-step time 0.576s loss=nan
INFO:tensorflow:Step 1900 per-step time 0.439s loss=nan

I1125 09:34:27.547188 139701278168960 model_lib_v2.py:652] Step 1900 per-step time 0.439s loss=nan
INFO:tensorflow:Step 2000 per-step time 0.445s loss=nan

I1125 09:35:17.008013 139701278168960 model_lib_v2.py:652] Step 2000 per-step time 0.445s loss=nan
INFO:tensorflow:Step 2100 per-step time 0.490s loss=nan

I1125 09:36:08.541600 139701278168960 model_lib_v2.py:652] Step 2100 per-step time 0.490s loss=nan
INFO:tensorflow:Step 2200 per-step time 0.697s loss=nan

标签: tensorflowmachine-learningdeep-learningobject-detection

解决方案


我见过很多事情使模型发散,这可能导致损失增加或准确性降低。

  1. 可能是由于学习率高,所以首先要降低学习率。
  2. DNNClassifier 如果您使用的是正确的分类器,请检查分类器。
  3. 检查标签是否正确,是否在损失函数域中。
  4. 还要检查损失函数。有时,这是原因,输入数据没有按照损失函数。
  5. 确保数据正确标准化。您可能希望像素在 [-1, 1] 而不是 [0, 255] 范围内。

推荐阅读