首页 > 解决方案 > TensorFlow - 填充洗牌缓冲区后获得高损失

问题描述

我在 TensorFlow 中使用数据集 API 。当我训练一个 CNN 时,我发现每次数据集填满 shuffle 缓冲区后,我的损失都会提高很多(损失与初始化时相同)。但它比一开始收敛得更快。(例如,损失从 10 下降到 8 需要第 1 步到第 500 步,但在填充缓冲区后,它会上升到 10,但只需要 100 或更少的步数就可以再次下降到 8。)

训练日志(在步骤 960时得到非常高的损失):

2018-07-15 18:06:37.125745: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 9594 of 10240
2018-07-15 18:06:37.852214: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:step: 0 loss: 10.590 acc: 0.000 time: 22.603
INFO:tensorflow:step: 1 loss: 9.966 acc: 0.039 time: 2.598
INFO:tensorflow:step: 2 loss: 8.744 acc: 0.033 time: 2.612
INFO:tensorflow:step: 3 loss: 10.865 acc: 0.020 time: 2.583
INFO:tensorflow:step: 4 loss: 7.930 acc: 0.047 time: 2.602
INFO:tensorflow:step: 5 loss: 8.070 acc: 0.027 time: 2.553
INFO:tensorflow:step: 6 loss: 8.437 acc: 0.031 time: 2.588
INFO:tensorflow:step: 7 loss: 8.677 acc: 0.039 time: 2.582
INFO:tensorflow:step: 8 loss: 8.571 acc: 0.021 time: 2.594
INFO:tensorflow:step: 9 loss: 8.581 acc: 0.033 time: 2.582
INFO:tensorflow:step: 10 loss: 8.333 acc: 0.006 time: 2.581
**************************omit some steps**************************
INFO:tensorflow:step: 954 loss: 8.333 acc: 0.008 time: 1.982
INFO:tensorflow:step: 955 loss: 8.385 acc: 0.006 time: 1.982
INFO:tensorflow:step: 956 loss: 8.297 acc: 0.006 time: 1.962
INFO:tensorflow:step: 957 loss: 8.173 acc: 0.004 time: 1.961
INFO:tensorflow:step: 958 loss: 8.155 acc: 0.004 time: 1.987
INFO:tensorflow:step: 959 loss: 8.189 acc: 0.018 time: 1.993
2018-07-15 18:46:42.820175: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 9907 of 10240
2018-07-15 18:46:43.191982: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
INFO:tensorflow:step: 960 loss: 12.504 acc: 0.000 time: 12.849
INFO:tensorflow:step: 961 loss: 11.177 acc: 0.000 time: 2.369
INFO:tensorflow:step: 962 loss: 10.289 acc: 0.000 time: 2.445
INFO:tensorflow:step: 963 loss: 9.984 acc: 0.000 time: 2.425
INFO:tensorflow:step: 964 loss: 9.824 acc: 0.000 time: 2.458

构建数据集的代码:

# read data list from disk
face_classes = sorted(os.listdir(data_dir))
num_class = len(face_classes)
class2index = {k: v for v, k in enumerate(face_classes)}
data_list = map(lambda cls: list_images(data_dir, cls, class2index), face_classes)
# flat the list
data_list = [item for sublist in data_list for item in sublist]

# create a dataset
dataset = tf.data.Dataset.from_tensor_slices(
                (list(map(lambda item: item[0], data_list)),
                 list(map(lambda item: item[1], data_list))))

dataset = dataset.prefetch(batch_size * 100)
dataset = dataset.map(decode_data) # decode file name to image
dataset = dataset.shuffle(50 * batch_size).repeat(epoch_num).batch(batch_size)

return dataset.make_one_shot_iterator().get_next()

一开始我以为这和https://stackoverflow.com/a/43670684/5634636是一样的bug ,但是我添加sorted()到之后os.listdir(data_dir),这个问题依然存在。这种现象看起来非常像这个问题,但我不能确定也不知道如何解决它。填充随机缓冲区时会发生什么?

标签: pythontensorflowtensorflow-datasets

解决方案


推荐阅读