首页 > 解决方案 > Reading Dataset from files where some might be missing

问题描述

I'm trying to load files to TensorFlow Dataset where some files might be missing (in which case I want to replace these with zeroes).

The structure of directories that I'm trying to read data from is as follows:

   |-data
   |---sensor_A
   |-----1.dat
   |-----2.dat
   |-----3.dat
   |---sensor_B
   |-----1.dat
   |-----2.dat
   |-----3.dat

.dat files are .csv files with spacebar as a separator. The content of every file is a single, multi-row observation where the number of columns is constant (say 4) and the number of rows is unknown (timeseries data).

I've successfully managed to read every sensor data to a separate TensorFlow Dataset with the following code:

import os
import tensorflow as tf

tf.enable_eager_execution()

data_root_dir = "data"

modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]

for mod_idx, modality in enumerate(modalities_to_use):
    # Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
    filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]

    dataset = tf.data.Dataset.from_tensor_slices((filenames,))


    def _parse_function_internal(filename):
        number_of_columns = 4
        single_observation = tf.read_file(filename)
        # Tokenise every value so we can cast these to floats later.
        single_observation = tf.string_split([single_observation], sep='\r\n ').values
        single_observation = tf.reshape(single_observation, (-1, number_of_columns))
        single_observation = tf.strings.to_number(single_observation, tf.float32)
        return filename, single_observation

    dataset = dataset.map(_parse_function_internal)

    print('Result:')
    for el in dataset:
        try:
            # Filename
            print(el[0])
            # Parsed file content
            print(el[1])
        except tf.errors.OutOfRangeError:
            break

which successfully prints out content of all three files for every sensor.

My problem is that some timestamps in the dataset might be missing. For instance if file 1.dat in sensor_A directory will be missing I'm getting this error:

tensorflow.python.framework.errors_impl.NotFoundError: NewRandomAccessFile failed to Create/Open: mock_data\sensor_A\1.dat : The system cannot find the file specified.
; No such file or directory
     [[{{node ReadFile}}]] [Op:IteratorGetNextSync]

which is thrown in this line:

for el in dataset:

What I've tried to do is to surround the call to tf.read_file() function with try block but obviously it doesn't work as the error is not thrown when tf.read_file() is called, but when the value is fetched from the dataset. Later I want to pass this dataset to a Keras model so I can't just surround it with try block. Is there any workaround? Is that even supported?

Thanks!

标签: pythontensorflowtensorflow-datasets

解决方案


我设法解决了这个问题,分享了解决方案,以防万一其他人也遇到了困难。我必须使用额外的布尔值列表来指定文件是否实际存在并将其传递给映射器。然后使用tf.cond()函数我们决定是读取文件还是用零(或任何其他逻辑)模拟数据。

import os
import tensorflow as tf

tf.enable_eager_execution()

data_root_dir = "data"

modalities_to_use = ["sensor_A", "sensor_B"]
timestamps = [1, 2, 3]

for mod_idx, modality in enumerate(modalities_to_use):
    # Will produce: ['data/sensor_A/1.dat', 'data/sensor_A/2.dat', 'data/sensor_A/3.dat']
    filenames = [os.path.join(data_root_dir, modality, str(timestamp) + ".dat") for timestamp in timestamps]
    files_exist = [os.path.isfile(filename) for filename in filenames]

    dataset = tf.data.Dataset.from_tensor_slices((filenames, files_exist))


    def _parse_function_internal(filename, file_exist):
        number_of_columns = 4
        single_observation = tf.cond(file_exist, lambda: tf.read_file(filename), lambda: ' '.join(['0.0'] * number_of_columns))
        # Tokenise every value so we can cast these to floats later.
        single_observation = tf.string_split([single_observation], sep='\r\n ').values
        single_observation = tf.reshape(single_observation, (-1, number_of_columns))
        single_observation = tf.strings.to_number(single_observation, tf.float32)
        return filename, single_observation

    dataset = dataset.map(_parse_function_internal)

    print('Result:')
    for el in dataset:
        try:
            # Filename
            print(el[0])
            # Parsed file content
            print(el[1])
        except tf.errors.OutOfRangeError:
            break

推荐阅读