首页 > 解决方案 > 使用 tfrecord 但文件太大

问题描述

我正在尝试从 numpy 数组的文件夹创建一个 tfrecord,该文件夹包含大约 2000 个 50mb 的 numpy 文件。

def convert(image_paths,out_path):
    # Args:
    # image_paths   List of file-paths for the images.
    # labels        Class-labels for the images.
    # out_path      File-path for the TFRecords output file.    
    print("Converting: " + out_path)
    # Number of images. Used when printing the progress.
    num_images = len(image_paths)    
    # Open a TFRecordWriter for the output-file.
    with tf.python_io.TFRecordWriter(out_path) as writer:        
        # Iterate over all the image-paths and class-labels.
        for i, (path) in enumerate(image_paths):
            # Print the percentage-progress.
            print_progress(count=i, total=num_images-1)
            # Load the image-file using matplotlib's imread function.
            img = np.load(path)
            # Convert the image to raw bytes.
            img_bytes = img.tostring()
            # Create a dict with the data we want to save in the
            # TFRecords file. You can add more relevant data here.
            data = \
                {
                    'image': wrap_bytes(img_bytes)
                }
            # Wrap the data as TensorFlow Features.
            feature = tf.train.Features(feature=data)
            # Wrap again as a TensorFlow Example.
            example = tf.train.Example(features=feature)
            # Serialize the data.
            serialized = example.SerializeToString()        
            # Write the serialized data to the TFRecords file.
            writer.write(serialized)

我认为它转换了大约 200 个文件,然后我得到了这个

Converting: tf.recordtrain
- Progress: 3.6%Traceback (most recent call last):
  File "tf_record.py", line 71, in <module>
out_path=path_tfrecords_train)
  File "tf_record.py", line 54, in convert
writer.write(serialized)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/lib/io/tf_record.py", line 236, in write
self._writer.WriteRecord(record, status)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.OutOfRangeError: tf.recordtrain; File too large

任何解决此问题的建议都会有所帮助,在此先感谢。

标签: tensorflow

解决方案


我不确定 tfrecords 的限制是什么,但假设您有足够的磁盘空间,更常见的方法是将数据集存储在多个 tfrecords 文件中,例如将每 20 个 numpy 文件存储在不同的 tfrecords 文件中。


推荐阅读