python - 仅将包含某个单词的 Excel sheet_names 读入 pandas 数据框
问题描述
我有很多报告要在 python 中编译成单个数据框。
此代码用于循环遍历我的目录并读取每个文件中工作表名称相同的所有报告文件......我在每个工作簿中有很多工作表,但只想找到包含特定字符串的 sheet_names,“报告'。
import pandas as pd
from pathlib import Path
import os
import glob
pathsting= 'path/to/working/directory'
rootdir = Path(pathsting)
onlydirs = [f for f in os.listdir(rootdir) if os.path.isdir(os.path.join(rootdir, f))]
df0 = pd.DataFrame()
for direct in onlydirs:
print(direct)
dirpathstring = pathsting + '\\' + direct
dirpath = Path(dirpathstring)
onlyfiles = [f for f in os.listdir(dirpath) if os.path.isfile(os.path.join(dirpath, f))]
for f in dirpath.glob("*Report.xlsm"):
print(f.name)
temp = pd.read_excel(f, sheet_name='Report')
df0 = pd.concat([df0, temp])
display(df0)
现在假设随着时间的推移,报告会更改格式,而不是sheet_name='Report'
变成sheet_name='XYZ Report'
. 我有很多报告,并且名称更改了几次。我不想在多个不同的循环中硬编码所有可能的报告名称。
我能够使用 glob 读取以“Report.xlsm”结尾的所有文件,但是是否有类似的方法来读取包含文本“Report”而不是确切字符串的 sheet_names?
解决方案
尝试:
import pandas as pd
import glob
import re
path = r'./files' # use your path
all_files = glob.glob(path + "/*.xlsm")
# case insensitive pattern for file names like blahReportblah or fooreportingss etc. Modify as required if necessary.
pattern = r'(?i)(.*report.*)'
# create empty list to hold dataframes from sheets found
dfs = []
# for each file in the path above ending .xlsm
for file in all_files:
#if the file name has the word 'report' or even 'rEpOrTs' in it
if re.search(pattern, file):
#open the file
ex_file = pd.ExcelFile(file)
#then for each sheet in that file
for sheet in ex_file.sheet_names:
#check if the sheet has 'RePORting' etc. in it
if re.search(pattern, sheet):
#if so create a dataframe (maybe parse_dates isn't required). Tweak as required
df = ex_file.parse(sheet, parse_dates=True)
#add this new (temp during the looping) frame to the end of the list
dfs.append(df)
else:
#if sheet doesn't have the word 'report' move on, nothing to see here
continue
else:
#if file doesn't have the word 'report' move on, nothing to see here
continue
#handle a list that is empty
if len(dfs) == 0:
print('No file or sheets found.')
#create a dummy frame
df = pd.DataFrame()
#or have only one item/frame and get it out
elif len(dfs) == 1:
df = dfs[0]
#or concatenate more than one frame together
else:
df = pd.concat(dfs, ignore_index=True)
df = df.reset_index(drop=True)
#check what you've got
print(df.head())
推荐阅读
- c# - 通过 .NET Core 应用程序将空间数据 (.shp) 上传到 PostGIS db
- mysql - 我正在使用带有 EXTRACTVALUE 命令的 MYSQL,当使用 [@attribute="value"] 过滤器时,以下查询返回 BLANK。为什么?
- c++ - 'operator' : 作为左操作数
- c# - 关于在 2d Platformer 项目的 C# 代码中使用 Get Accessor 而不是 if 语句的问题
- angular - Angular 8 hot/live reload 无法正常工作
- javascript - 从 CDN 包含引导程序时,类的自动完成在智能感知中不起作用
- c - 缓存是否有可能超过 100% 的未命中率
- github - 当另一个存储库创建新版本时触发 GitHub 操作
- c++ - 应该 C++ std::future
方法命名为 is_ready() 还是 ready()? - laravel - 如何以更新形式为 Laravel 检索图像数据库