首页 > 解决方案 > 使用 os.walk() 从文件夹中的图像保存像素值时速度极慢

问题描述

我正在尝试创建一个名为的文件data.pickle,其中包含文件夹中每个图像的像素值。的格式data.pickle类似于mnist dataset,即2个numpy数组的元组。

首先,我仅使用 1 个文件夹来测试代码,该文件夹包含大约 1000 张 jpg 图像(大小为 15kb/图像)和 1000 张 png 图像(大小为 1.3kb/图像)。代码在第一次执行时运行良好,但如果我删除data.pickle文件并再次运行,代码运行速度会慢得多。我是python的新手,我想知道python是否即使在执行之后也会将数据保留在内存中?这是不合理的。

然后,如果我使用我拥有的整个数据集(比上述数据大 10 倍),代码将不再可执行(运行 1 天后执行被终止)

您能否指出我在哪里做错了或建议我一些更好的方法。我真的很感谢你的帮助。

谢谢你,祝你有美好的一天。

这是我的代码:

import os
import pickle #to save tuple
import cv2
import numpy as np

PATH_DATA = 'data'
#data folder structure:
#data
#    |
#    category1
#    |       |
#    |       input 
#    |       |   |
#    |       |   images0000001.jpg... #1000 images
#    |       |
#    |       ground_truth
#    |           |
#    |           gt0000001.png... #1000 images
#    category2... #10 categories

def add_slash(path_list): #add a slash between every path
    return '/'.join(path_list)

images = [] #list of train images
groundtruths = [] #list of train labels

for path_in_data, dirs_in_data, files_in_data in os.walk(PATH_DATA): #get category = dirs list
    for category in dirs_in_data: #loop through every category
        for path_in_category, dirs_in_category, files_in_category in os.walk(add_slash([path_in_data, category])): #get dirs in category
            for dir_ in dirs_in_category: #proceed into groundtruth folder
                if dir_ == 'groundtruth': #only process groundtruth folder
                    for path_in_groundtruth, dirs_in_groundtruth, files_in_groundtruth in os.walk(add_slash([path_in_category, dir_])):
                        #loop through each images
                        files_in_groundtruth.sort() #to ensure the data obtained is the same every execution
                        for file_ in files_in_groundtruth:
                            groundtruth = cv2.imread(add_slash([path_in_groundtruth, file_]))
                            foreground = cv2.imread(add_slash([path_in_groundtruth.replace('groundtruth', 'input'), file_.replace('gt','in').replace('png','jpg')]))
                            #append image and groundtruth to list
                            images.append(foreground)
                            groundtruths.append(groundtruth)
                        break #only process top directory in groundtruth
                else:
                    continue #skip input folder, only process groundtruth folder
            break #only process the top directory in category
    images = np.asarray(images) #convert list to numpy array
    groundtruths = np.asarray(groundtruths)

    #save as a tuple for use
    print(f'Saving dataset to {path_in_data}')
    with open(f'{path_in_data}/data.pickle', 'wb') as f:
        pickle.dump((images, groundtruths), f)
    break #only process the top directory in data

标签: pythonopencvdeep-learning

解决方案


推荐阅读