首页 > 解决方案 > 为什么我不能将我的图像数据集拆分为 8:1:1?

问题描述

我尝试将我的数据集拆分为 8:1:1 并且我的数据集位于一个目录中,起初我尝试使用此代码

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  dir,
  validation_split=0.1,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

但它并没有做这项工作,只是在此之后将我的目录拆分为 val_ds 和 test_ds,我使用这段代码

# create a data generator
datagen = ImageDataGenerator()
# load and iterate training dataset
train_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')
# load and iterate validation dataset
val_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')
# load and iterate test dataset
test_it = datagen.flow_from_directory(dir, target_size=(32, 32), color_mode='grayscale', class_mode='binary', batch_size=32, shuffle=True, follow_links=False, subset=None, interpolation='nearest')

这段代码我的模型也有问题,所以当我使用这段代码时,我的模型摘要会是这样的

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
rescaling_1 (Rescaling)      (None, None, None, None)  0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, None, None, 32)    320       
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, None, None, 32)    0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, None, None, 32)    9248      
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, None, None, 32)    0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, None, None, 32)    9248      
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, None, None, 32)    0         
_________________________________________________________________
dropout_1 (Dropout)          (None, None, None, 32)    0         
_________________________________________________________________
flatten_1 (Flatten)          (None, None)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 128)               16512     
_________________________________________________________________
dense_3 (Dense)              (None, 26)                3354      
=================================================================
Total params: 38,682
Trainable params: 38,682
Non-trainable params: 0
_________________________________________________________________

这是我的模型

num_classes = 26

model = tf.keras.Sequential([
  tf.keras.layers.experimental.preprocessing.Rescaling(1./255),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  tf.keras.layers.Conv2D(32, 3, activation='relu'),
  tf.keras.layers.MaxPooling2D(),
  layers.Dropout(0.2),
  tf.keras.layers.Flatten(),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(num_classes)
])
model.compile(
  optimizer='adam',
  loss=tf.losses.SparseCategoricalCrossentropy(from_logits=True),
  metrics=['accuracy'])

所以我需要知道如何毫无问题地拆分我的数据?

标签: tensorflowmachine-learningkerasdeep-learningconv-neural-network

解决方案


您可以执行以下操作

import glob # To get the whole path for all the images
'''
Let's consider that your images lie inside 2 folders - 'a' and 'b' which are inside your 'dir' folder. To get paths to each of those images you can use
the below code
'''
image_paths_a = glob.glob('./dir/a/*.jpg') # .jpg if the files ends with jpg
image_paths_b = glob.glob('./dir/b/*.jpg') # to get images from b
images_total = image_paths_a + image_paths_b

# In case you have other folders you can also do this
# to get all images inside all folder in 'dir' folder.
images_total = glob.glob('./dir/*/*.jpg') 

# Now get the labels corresponding to these images
# If you have labeled as folder names then you can do it like this
image_labels = [i.split('/')[-2] for i in images_total]

'''
After doing the above you have 2 lists -> 1) Image paths 2) corresponding labels and now you can just use the 'sklearn.model_selection.train_test_split' to get your splits
'''
from sklearn.model_selection import train_test_split

# To set train data and get rest 20% for further split
xtrain, xtest, ytrain, ytest = trian_test_split(images_total, 
                                                image_labels,
                                                stratify=image_labels,
                                                random_state=1234,
                                                test_size=0.2)

# get 10%-10% of original data
xvalid, xtest, yvalid, ytest= trian_test_split(xtest, 
                                                ytest,
                                                stratify=ytest,
                                                random_state=1234,
                                                test_size=0.5)

'''
Now you can just create a dataset but before that you will create a function to read images from image paths.
'''
def read_img(path, label):
  file = tf.io.read_file(path)
  img = tf.image.decode_png(file)
  # dim1 and dim2 are your desired dimensions
  img = tf.image.resize(img, (dim1, dim2))
  return img, label

train_dataset = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
train_dataset = train_dataset.map(read_img).batch(batch_size)

valid_dataset = tf.data.Dataset.from_tensor_slices((xvalid, yvalid))
valid_dataset = valid_dataset.map(read_img).batch(batch_size)

test_dataset = tf.data.Dataset.from_tensor_slices((xtest, ytest))
test_dataset = test_dataset.map(read_img).batch(batch_size)

# Now you just need to train your model
model.fit(train_dataset, epochs=5, validation_data=valid_dataset)

推荐阅读