I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I can’t find a good solution.

What I am trying

I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.

Here is how I load the filenames in a dataset:

class Data:
def __init__(self, in_path, out_path):
    self.nb_images = 512
    self.test_ratio = 0.2
    self.batch_size = 8

    # load filenames in input and outputs
    inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)

    self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
    self.size_test = int(self.nb_images * self.test_ratio)

    # split arrays in training / validation
    test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
    test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)

    # transform array to tf.data.Dataset
    self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
    self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))

I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.

def get_batched_data(self, seed, batch_size):
    nb_batch = int(self.size_training / batch_size)

    def img_to_tensor(path_in, path_out):
        img_string_in = tf.read_file(path_in)
        img_string_out = tf.read_file(path_out)
        im_in = tf.image.decode_jpeg(img_string_in, channels=1)
        im_out = tf.image.decode_jpeg(img_string_out, channels=1)
        return im_in, im_out

    t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
    t_datas = t_datas.map(img_to_tensor)
    t_datas = t_datas.batch(batch_size)
    return t_datas

Now during the training, at each epoch we call the get_batched_data function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.

for epoch in range(nb_epoch):
    sess_iter_in = tf.Session()
    sess_iter_out = tf.Session()

    batched_train = data.get_batched_data(epoch)
    iterator_train = batched_train.make_one_shot_iterator()
    in_data, out_data = iterator_train.get_next()

    total_batch = int(data.size_training / batch_size)
    for batch in range(total_batch):
        print(f"{batch + 1} / {total_batch}")
        in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
        out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
        sess.run(optimizer, feed_dict={inputs: in_images,
                                       outputs: out_images})

What do I need ?

I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.

Questions and problems

First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset in this case.

I found solution by using repeat(epoch) on the dataset, like this, but the shuffle will not be different for each epoch in this case.

The second problem with my implementation is that I have an OutOfRangeError in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n) call on dataset?

Sorry for this loooonng question, but I've been struggling with this for a few days.

据我所知,Official Performance Guideline是制作输入管道的最佳教材。


使用 shuffle() 和 repeat(),您可以为每个 epoch 获得不同的 shuffle 模式。您可以使用以下代码进行确认

dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)

iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()

with tf.Session() as sess:
    for i in range(10):

您也可以使用上述官方页面中提到的 tf.contrib.data.shuffle_and_repeat。

除了创建数据管道之外,您的代码中还有一些问题。您将图构建与图执行混淆了。您正在重复创建数据输入管道,因此有许多冗余输入管道,其数量与 epoch 一样多。您可以通过 Tensorboard 观察冗余管道。


batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()

for epoch in range(nb_epoch):
    # reset iterator's state

        while True:
            in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
            out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
            sess.run(optimizer, feed_dict={inputs: in_images,
                                           outputs: out_images})
    except tf.errors.OutOfRangeError:

此外,还有一些不重要的低效代码。您使用 from_tensor_slices() 加载了文件路径列表,因此该列表嵌入到您的图表中。(详见https://www.tensorflow.org/guide/datasets#sumption_numpy_arrays

您最好使用预取,并通过组合您的图表来减少 sess.run 调用。
