首页 > 解决方案 > 在 Python 中使用递归进行高效的驱动器扫描

问题描述

原始问题:

我正在尝试递归扫描目录,以便使用下面的代码获取磁盘的占用大小以及每个文件和文件夹的详细信息。下面的代码运行良好,但我需要一些关于提高效率的建议,以便扫描占用空间/数据为 200 GB 或更多的驱动器。5.49 GB(244,169 个文件和 34,253 个文件夹)占用空间的磁盘的测试结果如下:

  1. 如果代码在没有列表追加操作的情况下运行,则扫描磁盘大约需要 8 分钟,效率不高
  2. 如果我包含 list append 语句,情况会变得更糟,然后大约需要 25 分钟 --> 瓶颈
import os
import sys
import logging
from os.path import *

def scanSlot(path):
    """Return total size of files in given path and subdirs."""
    global dir_list
    global path_list
    try:
        dir_list = os.scandir(path)
    except Exception as e:
        logging.info(">>> Access Denied for " + path)
        dir_list = {}

    tot_size = 0 
    for file in dir_list:
        file_stat = file.stat()
        time_stat = os.stat(file.path)

            # If the given file object is a directory recursively call the function again
            if file.is_dir(follow_symlinks=False):

                #Recursive Call
                bytes = scanSlot(file.path)

                #Calculating Size
                tot_size += bytes

                # logging.info('Dirname:'+str(file)+' Path'+str(file.path) +' Size'+str(bytes))
         
                # List Append
                path_list.append((file.name,file.path,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,tot_size,
                                       datetime.fromtimestamp(time_stat.st_atime),datetime.fromtimestamp(time_stat.st_mtime),datetime.fromtimestamp(time_stat.st_ctime),"dir"))

            # If the given file object is a file then retrieve all the details
            if file.is_file(follow_symlinks=False):
               
                # logging.info('Filename:'+str(file)+ ' Path'+str(file.path) +' Size'+str(time_stat))
    
                tot_size += file.stat(follow_symlinks=False).st_size

                # List Append
                path_list.append((file.name,file.path,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,file_stat.st_size,
                                      datetime.fromtimestamp(time_stat.st_atime),datetime.fromtimestamp(time_stat.st_mtime),datetime.fromtimestamp(time_stat.st_ctime),"file"))
    return tot_size

上述代码的函数调用:

server_size = scanSlot('D:\\New folder')

我尝试使用以下方法优化代码:

  1. Python 库numba,这不起作用,因为 numba 没有正在使用的 os 库的实现。
  2. 尝试将代码转换为Cython,但不确定这是否有帮助

列表追加操作不容忽视,因为进一步分析需要该路径列表中的详细信息。

更新:

根据@triplee的建议并在此处的实现的帮助下,我已经使用 os.walk() 实现了目录扫描,显然它的速度更快(19.6GB,2,75,559 个文件,38,592 个文件夹在执行 I/O 时在 20 分钟内扫描到每个文件目录的日志文件)。代码如下:

仅供参考:仍在测试

def scanSlot(path):
    total_size = 0
    global path_list
    for dirpath, dirnames, filenames in os.walk(path):
                
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                # logging.info('Filename:'+str(fp)+'Size:'+str(os.path.getsize(fp)))
                file_stat = os.stat(fp)
                path_list.append((f,fp,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,file_stat.st_gid,os.path.getsize(fp),
                                       datetime.fromtimestamp(file_stat.st_atime),datetime.fromtimestamp(file_stat.st_mtime),datetime.fromtimestamp(file_stat.st_ctime),"file"))
                total_size += os.path.getsize(fp)


    return total_size

if __name__ == "__main__":
    logging.info('>>> Start:' + str(datetime.now().time()))

    # So basically run os.walk() for a drive and then do so for all directories present in it

    for dirpath, dirnames, filenames in os.walk('F:\\'):

        for f in dirnames:
            fp = os.path.join(dirpath, f)
            dir_size = scanSlot(fp)
            file_stat = os.stat(fp)
            logging.info('Dirname:'+str(fp)+'Size:'+str(dir_size))
            path_list.append((f,fp,file_stat.st_mode,file_stat.st_ino,file_stat.st_dev,file_stat.st_nlink,file_stat.st_uid,dir_size,
                                       datetime.fromtimestamp(file_stat.st_atime),datetime.fromtimestamp(file_stat.st_mtime),datetime.fromtimestamp(file_stat.st_ctime),"dir"))
           
           
    logging.info('>>> End:' + str(datetime.now().time()))

进一步的问题:

参考:

follow_symlinks的解释也可以在上面的参考链接中找到。

标签: pythonrecursiondirectorycythonpython-multiprocessing

解决方案


推荐阅读