python - 在 pandas 数据帧上迭代函数的最快方法
问题描述
我有一个函数可以对 csv 文件的行进行操作,根据是否满足条件将不同单元格的值添加到字典中:
df = pd.concat([pd.read_csv(filename) for filename in args.csv], ignore_index = True)
ID_Use_Totals = {}
ID_Order_Dates = {}
ID_Received_Dates = {}
ID_Refs = {}
IDs = args.ID
def TSQs(row):
global ID_Use_Totals, ID_Order_Dates, ID_Received_Dates
if row['Stock Item'] not in IDs:
pass
else:
if row['Action'] in ['Order/Resupply', 'Cons. Purchase']:
if row['Stock Item'] not in ID_Order_Dates:
ID_Order_Dates[row['Stock Item']] = [{row['Ref']: pd.to_datetime(row['TransDate'])}]
else:
ID_Order_Dates[row['Stock Item']].append({row['Ref']: pd.to_datetime(row['TransDate'])})
elif row['Action'] == 'Received':
if row['Stock Item'] not in ID_Received_Dates:
ID_Received_Dates[row['Stock Item']] = [{row['Ref']: pd.to_datetime(row['TransDate'])}]
else:
ID_Received_Dates[row['Stock Item']].append({row['Ref']: pd.to_datetime(row['TransDate'])})
elif row['Action'] == 'Use':
if row['Stock Item'] in ID_Use_Totals:
ID_Use_Totals[row['Stock Item']].append(row['Qty'])
else:
ID_Use_Totals[row['Stock Item']] = [row['Qty']]
else:
pass
目前,我正在做:
for index, row in df.iterrows():
TSQs(row)
但是timer()
对于 40,000 行的 csv 文件,返回 70 到 90 秒。
我想知道在整个数据帧(可能是数十万行)上实现这一点的最快方法是什么。
解决方案
我敢打赌不使用 Pandas 可能会更快。
此外,您可以使用defaultdict
s 来避免检查您是否已经看过给定的产品:
import csv
import collections
import datetime
ID_Use_Totals = collections.defaultdict(list)
ID_Order_Dates = collections.defaultdict(list)
ID_Received_Dates = collections.defaultdict(list)
ID_Refs = {}
IDs = set(args.ID)
order_actions = {"Order/Resupply", "Cons. Purchase"}
for filename in args.csv:
with open(filename) as f:
for row in csv.DictReader(f):
item = row["Stock Item"]
if item not in IDs:
continue
ref = row["Ref"]
action = row["Action"]
if action in order_actions:
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Order_Dates[item].append({ref: date})
elif action == "Received":
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Received_Dates[item].append({ref: date})
elif action == "Use":
ID_Use_Totals[item].append(row["Qty"])
编辑:如果 CSV 真的是形式
"Employee", "Stock Location", "Stock Item"
"Ordered", "16", "32142"
stock CSV 模块无法完全解析它。
您可以使用 Pandas 解析文件,然后遍历行,但我不确定这最终是否会更快:
import collections
import datetime
import pandas
ID_Use_Totals = collections.defaultdict(list)
ID_Order_Dates = collections.defaultdict(list)
ID_Received_Dates = collections.defaultdict(list)
ID_Refs = {}
IDs = set(args.ID)
order_actions = {"Order/Resupply", "Cons. Purchase"}
for filename in args.csv:
for index, row in pd.read_csv(filename).iterrows():
item = row["Stock Item"]
if item not in IDs:
continue
ref = row["Ref"]
action = row["Action"]
if action in order_actions:
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Order_Dates[item].append({ref: date})
elif action == "Received":
date = datetime.datetime.fromisoformat(row["TransDate"])
ID_Received_Dates[item].append({ref: date})
elif action == "Use":
ID_Use_Totals[item].append(row["Qty"])