首页 > 解决方案 > 仅将包含某个单词的 Excel sheet_names 读入 pandas 数据框

问题描述

我有很多报告要在 python 中编译成单个数据框。

此代码用于循环遍历我的目录并读取每个文件中工作表名称相同的所有报告文件......我在每个工作簿中有很多工作表,但只想找到包含特定字符串的 sheet_names,“报告'。

import pandas as pd
from pathlib import Path
import os
import glob

pathsting= 'path/to/working/directory'
rootdir = Path(pathsting)
onlydirs = [f for f in os.listdir(rootdir) if os.path.isdir(os.path.join(rootdir, f))]

df0 = pd.DataFrame()
for direct in onlydirs:
    print(direct)
    dirpathstring = pathsting + '\\' + direct
    dirpath = Path(dirpathstring)
    onlyfiles = [f for f in os.listdir(dirpath) if os.path.isfile(os.path.join(dirpath, f))]
    for f in dirpath.glob("*Report.xlsm"):
        print(f.name)
        temp = pd.read_excel(f, sheet_name='Report')
        df0 = pd.concat([df0, temp])
display(df0)

现在假设随着时间的推移,报告会更改格式,而不是sheet_name='Report'变成sheet_name='XYZ Report'. 我有很多报告,并且名称更改了几次。我不想在多个不同的循环中硬编码所有可能的报告名称。

我能够使用 glob 读取以“Report.xlsm”结尾的所有文件,但是是否有类似的方法来读取包含文本“Report”而不是确切字符串的 sheet_names?

标签: pythonexcelpandasdataframe

解决方案


尝试:

import pandas as pd
import glob
import re

path = r'./files' # use your path
all_files = glob.glob(path + "/*.xlsm")

# case insensitive pattern for file names like blahReportblah or fooreportingss etc.  Modify as required if necessary.
pattern = r'(?i)(.*report.*)'

# create empty list to hold dataframes from sheets found
dfs = []

# for each file in the path above ending .xlsm
for file in all_files:
    #if the file name has the word 'report' or even 'rEpOrTs' in it
    if re.search(pattern, file):
        #open the file
        ex_file = pd.ExcelFile(file)
        #then for each sheet in that file
        for sheet in ex_file.sheet_names:
            #check if the sheet has 'RePORting' etc. in it
            if re.search(pattern, sheet):
                #if so create a dataframe (maybe parse_dates isn't required).  Tweak as required
                df = ex_file.parse(sheet, parse_dates=True)
                #add this new (temp during the looping) frame to the end of the list
                dfs.append(df)
            else:
                #if sheet doesn't have the word 'report' move on, nothing to see here
                continue
    else:
        #if file doesn't have the word 'report' move on, nothing to see here
        continue

#handle a list that is empty
if len(dfs) == 0:
    print('No file or sheets found.')
    #create a dummy frame
    df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
    df = dfs[0]
#or concatenate more than one frame together
else:
    df = pd.concat(dfs, ignore_index=True)
    df = df.reset_index(drop=True)

#check what you've got
print(df.head())

推荐阅读