首页 > 解决方案 > 什么是根据文件名中的日期模式从目录中读取文件子集的更快、更节省内存的方法?

问题描述

我现在拥有的代码:

cols = ['X','Y','Z','W','A']
path = r'/Desktop/files'
all_files = glob.glob(path + "/file*")
d_list = pd.date_range('2019-09-01','2020-09-09',freq='D').strftime("%Y-%m-%d").tolist()
 
list1 = []
 
for i in d_list:      
    for filename in all_files:
        if i in filename:
            df = pd.read_csv(filename,sep='|',usecols=cols)
            list1.append(df)
 
data = pd.concat(list1, axis=0, ignore_index=True)

这段代码需要很长时间才能运行,我假设我没有足够的内存。有没有其他方法可以让它更快?如果有人知道我如何使用 dask.dataframe 并且如果这会有所帮助,而且还保留变量的原始 dtypes,请告诉我。

谢谢!

标签: pythonpandasloopscsvdask

解决方案


使用 dask 尝试以下操作:

import dask.dataframe as dd

#This is an example of a common pattern you could have for your files, so that you can loop through them one time rather than loop through a list of dates 10x.
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')

df = dd.concat([dd.read_csv(f, sep='|', usecols=cols) for f in all_files])
#df1 = df.compute() #returns a pandas dataframe from the dask dataframe

Pandas 的语法基本相同:

import pandas as pd
all_files = glob.glob(r'/Desktop/files/file*2019-09-0*.csv')
df = pd.concat([pd.read_csv(f, sep='|', usecols=cols) for f in all_files])

推荐阅读