python - How to use properly Tensorflow Dataset with batch?
问题描述
I am new to Tensorflow and deep learning, and I am struggling with the Dataset class. I tried a lot of things and I can’t find a good solution.
What I am trying
I have a large amount of images (500k+) to train my DNN with. This is a denoising autoencoder so I have a pair of each image. I am using the dataset class of TF to manage the data, but I think I use it really badly.
Here is how I load the filenames in a dataset:
class Data:
def __init__(self, in_path, out_path):
self.nb_images = 512
self.test_ratio = 0.2
self.batch_size = 8
# load filenames in input and outputs
inputs, outputs, self.nb_images = self._load_data_pair_paths(in_path, out_path, self.nb_images)
self.size_training = self.nb_images - int(self.nb_images * self.test_ratio)
self.size_test = int(self.nb_images * self.test_ratio)
# split arrays in training / validation
test_data_in, training_data_in = self._split_test_data(inputs, self.test_ratio)
test_data_out, training_data_out = self._split_test_data(outputs, self.test_ratio)
# transform array to tf.data.Dataset
self.train_dataset = tf.data.Dataset.from_tensor_slices((training_data_in, training_data_out))
self.test_dataset = tf.data.Dataset.from_tensor_slices((test_data_in, test_data_out))
I have a function to call at each epoch that will prepare the dataset. It shuffles the filenames, and transforms filenames to images and batch data.
def get_batched_data(self, seed, batch_size):
nb_batch = int(self.size_training / batch_size)
def img_to_tensor(path_in, path_out):
img_string_in = tf.read_file(path_in)
img_string_out = tf.read_file(path_out)
im_in = tf.image.decode_jpeg(img_string_in, channels=1)
im_out = tf.image.decode_jpeg(img_string_out, channels=1)
return im_in, im_out
t_datas = self.train_dataset.shuffle(self.size_training, seed=seed)
t_datas = t_datas.map(img_to_tensor)
t_datas = t_datas.batch(batch_size)
return t_datas
Now during the training, at each epoch we call the get_batched_data
function, make an iterator, and run it for each batch, then feed the array to the optimizer operation.
for epoch in range(nb_epoch):
sess_iter_in = tf.Session()
sess_iter_out = tf.Session()
batched_train = data.get_batched_data(epoch)
iterator_train = batched_train.make_one_shot_iterator()
in_data, out_data = iterator_train.get_next()
total_batch = int(data.size_training / batch_size)
for batch in range(total_batch):
print(f"{batch + 1} / {total_batch}")
in_images = sess_iter_in.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess_iter_out.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
What do I need ?
I need to have a pipeline that loads only the images of the current batch (otherwise it will not fit in memory) and I want to shuffle the dataset in a different way for each epoch.
Questions and problems
First question, am I using the Dataset class in a good way? I saw very different things on the internet, for example in this blog post the dataset is used with a placeholder and fed during the learning with the datas. It seems strange because the data are all in an array, so loaded in memory. I don't see the point of using tf.data.dataset
in this case.
I found solution by using repeat(epoch)
on the dataset, like this, but the shuffle will not be different for each epoch in this case.
The second problem with my implementation is that I have an OutOfRangeError
in some cases. With a small amount of data (512 like in the exemple) it works fine, but with a bigger amount of data, the error occurs. I thought it was because of a bad calculation of the number of batch due to bad rounding, or when the last batch has a smaller amount of data, but it happens in batch 32 out of 115... Is there any way to know the number of batch created after a batch(n)
call on dataset?
Sorry for this loooonng question, but I've been struggling with this for a few days.
解决方案
据我所知,Official Performance Guideline是制作输入管道的最佳教材。
我想为每个时期以不同的方式洗牌数据集。
使用 shuffle() 和 repeat(),您可以为每个 epoch 获得不同的 shuffle 模式。您可以使用以下代码进行确认
dataset = tf.data.Dataset.from_tensor_slices([1,2,3,4])
dataset = dataset.shuffle(4)
dataset = dataset.repeat(3)
iterator = dataset.make_one_shot_iterator()
x = iterator.get_next()
with tf.Session() as sess:
for i in range(10):
print(sess.run(x))
您也可以使用上述官方页面中提到的 tf.contrib.data.shuffle_and_repeat。
除了创建数据管道之外,您的代码中还有一些问题。您将图构建与图执行混淆了。您正在重复创建数据输入管道,因此有许多冗余输入管道,其数量与 epoch 一样多。您可以通过 Tensorboard 观察冗余管道。
您应该将图形构造代码放在循环之外,如下代码(伪代码)
batched_train = data.get_batched_data()
iterator = batched_train.make_initializable_iterator()
in_data, out_data = iterator_train.get_next()
for epoch in range(nb_epoch):
# reset iterator's state
sess.run(iterator.initializer)
try:
while True:
in_images = sess.run(in_data).reshape((-1, 64, 64, 1))
out_images = sess.run(out_data).reshape((-1, 64, 64, 1))
sess.run(optimizer, feed_dict={inputs: in_images,
outputs: out_images})
except tf.errors.OutOfRangeError:
pass
此外,还有一些不重要的低效代码。您使用 from_tensor_slices() 加载了文件路径列表,因此该列表嵌入到您的图表中。(详见https://www.tensorflow.org/guide/datasets#sumption_numpy_arrays)
您最好使用预取,并通过组合您的图表来减少 sess.run 调用。
推荐阅读
- javascript - 单击最后一个点时如何绘制多边形?
- javascript - 如何在 Promise 中使用 Promise?
- hbase - Geomesa HBase中基于JSON属性的远程过滤
- python - 将 jupyter notebook 文件转换为 .exe 文件时出错
- shell - 需要 sudo 来运行 erlang escript,即使该文件具有执行权限
- ruby-on-rails - 如何让 Minitest 在控制器测试中解析助手?
- c++ - 在 C++ 代码中询问以提示用户输入函数表达式
- python - 如何将文件中的行加载到可以是字符串、浮点数或整数的变量中?
- css - 为什么 flex-basis 计算在 SASS 中返回错误?
- pandas - pandas isin() 在具有对象类型列的数据帧之间失败