首页 > 解决方案 > 结合两个数据生成器来训练一个 CNN

问题描述

我正在尝试使用我分成两部分的数据集来训练模型,对于每个部分,我使用 keras 和 tensorflow 创建一个不同的 ImageDataGenerator。

我的问题是,如何结合我的两个生成器的数据来训练模型。我不想单独使用每一个

所有人的 tnx

标签: tensorflowkerasdatasetgeneratordata-augmentation

解决方案


您已将所有数据分成两个不同的目录;现在你想用这两个目录中的数据训练模型。

您可以通过两种方式实现此目的:

  1. Keras ImageDataGeneratorflow_from_directory方法有一个follow_links参数。您可以使用follow_links. 根据您的需要和所需的类结构创建一个单独的目录。在其中,从您的原始数据目录创建符号链接。在下图中,您可以使用该data目录作为主要输入目录。

    .
    ├── Directory1/
    │   ├── Class1/
    │   └── Class2/
    ├── Directory2/
    │   ├── Class1/
    │   └── Class2/
    └── Data/
        ├── Class1/
        │   ├── symlink_to_Directory1_Class1
        │   └── symlink_to_Directory2_Class1
        └── Class2/
            ├── symlink_to_Directory1_Class2
            └── symlink_to_Directory2_Class2
    
    
  2. 为两个不同的目录制作ImageDatagenerator 了两个不同的目录。然后将它们合并为一个。在这种情况下,子生成器的批量大小必须与各个目录中的数据数量成比例地确定。

    子生成器的批量大小b = \frac{B \times n}{\sum{n}}

    Where,
    b = Batch Size Of Any Sub-generator
    B = Desired Batch Size Of The Merged Generator
    n = Number Of Images In That Directory Of Sub-generator
    the sum of n = Total Number Of Images In All Directories
    

    请参阅下面的代码

    from keras.preprocessing.image import ImageDataGenerator
    from keras.utils import Sequence
    import matplotlib.pyplot as plt
    import numpy as np
    import os
    
    
    class MergedGenerators(Sequence):
    
        def __init__(self, batch_size, generators=[], sub_batch_size=[]):
            self.generators = generators
            self.sub_batch_size = sub_batch_size
            self.batch_size = batch_size
    
        def __len__(self):
            return int(
                sum([(len(self.generators[idx]) * self.sub_batch_size[idx])
                     for idx in range(len(self.sub_batch_size))]) /
                self.batch_size)
    
        def __getitem__(self, index):
            """Getting items from the generators and packing them"""
    
            X_batch = []
            Y_batch = []
            for generator in self.generators:
                if generator.class_mode is None:
                    x1 = generator[index % len(generator)]
                    X_batch = [*X_batch, *x1]
    
                else:
                    x1, y1 = generator[index % len(generator)]
                    X_batch = [*X_batch, *x1]
                    Y_batch = [*Y_batch, *y1]
    
            if self.generators[0].class_mode is None:
                return np.array(X_batch)
            return np.array(X_batch), np.array(Y_batch)
    
    
    def build_datagenerator(dir1=None, dir2=None, batch_size=32):
        n_images_in_dir1 = sum([len(files) for r, d, files in os.walk(dir1)])
        n_images_in_dir2 = sum([len(files) for r, d, files in os.walk(dir2)])
    
        # Have to set different batch size for two generators as number of images
        # in those two directories are not same. As we have to equalize the image
        # share in the generators
        generator1_batch_size = int((n_images_in_dir1 * batch_size) /
                                    (n_images_in_dir1 + n_images_in_dir2))
    
        generator2_batch_size = batch_size - generator1_batch_size
    
        generator1 = ImageDataGenerator(
            rescale=1. / 255,
            shear_range=0.2,
            zoom_range=0.2,
            rotation_range=5.,
            horizontal_flip=True,
        )
    
        generator2 = ImageDataGenerator(
            rescale=1. / 255,
            zoom_range=0.2,
            horizontal_flip=False,
        )
    
        # generator2 has different image augmentation attributes than generaor1
        generator1 = generator1.flow_from_directory(
            dir1,
            target_size=(128, 128),
            color_mode='rgb',
            class_mode=None,
            batch_size=generator1_batch_size,
            shuffle=True,
            seed=42,
            interpolation="bicubic",
        )
    
        generator2 = generator2.flow_from_directory(
            dir2,
            target_size=(128, 128),
            color_mode='rgb',
            class_mode=None,
            batch_size=generator2_batch_size,
            shuffle=True,
            seed=42,
            interpolation="bicubic",
        )
    
        return MergedGenerators(
            batch_size,
            generators=[generator1, generator2],
            sub_batch_size=[generator1_batch_size, generator2_batch_size])
    
    
    def test_datagen(batch_size=32):
        datagen = build_datagenerator(dir1="./asdf",
                                      dir2="./asdf2",
                                      batch_size=batch_size)
    
        print("Datagenerator length (Batch count):", len(datagen))
    
        for batch_count, image_batch in enumerate(datagen):
            if batch_count == 1:
                break
    
            print("Images: ", image_batch.shape)
    
            plt.figure(figsize=(10, 10))
            for i in range(image_batch.shape[0]):
                plt.subplot(1, batch_size, i + 1)
                plt.imshow(image_batch[i], interpolation='nearest')
                plt.axis('off')
                plt.tight_layout()
    
    
    test_datagen(4)
    
    

推荐阅读