python - Pandas Dataframe: to_dict() poor performance
问题描述
I work with apis that return large pandas dataframes. I'm not aware of a fast way to iterate through the dataframe directly so I cast to a dictionary with to_dict()
.
After my data is in dictionary form, the performance is fine. However, the to_dict()
operation tends to be a performance bottleneck.
I often group columns of the dataframe together to form multi-index and use the 'index' orientation for to_dict()
. Not sure if the large multi-index drives the poor performance.
Is there a faster way to cast a pandas dataframe? Maybe there is a better way to iterate directly over the dataframe without any cast? Not sure if there is a way I could apply vectorization.
Below I give sample code which mimics the issue with timings:
import pandas as pd
import random as rd
import time
#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)
#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))
#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))
#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))
#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))
Thank you!
解决方案
常见的指导是不要迭代,在所有行列或分组行/列上使用函数。.values
下面,在第三个代码块中显示了如何迭代作为属性的 numpy 数组。结果是:
枢轴建设需要:0.012315988540649414
数据框迭代需要:0.32346272468566895
对值的迭代需要:0.004369020462036133
转换为字典需要:0.023524761199951172
字典迭代需要:0.0010480880737304688
import pandas as pd
from io import StringIO
# Test data
import pandas as pd
import random as rd
import time
#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)
#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))
#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))
#Iterate over all values in pivot table
t0 = time.time()
v = df_pivot.values
for row in range(df_pivot.shape[0]):
for column in range(df_pivot.shape[1]):
test = v[row, column]
t1 = time.time()
print('Iteration over values takes: ' + str(t1-t0))
#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))
#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))
推荐阅读
- python - 进程在获取字体时以退出代码 1 完成
- postgresql-10 - 无法从 pg_dump 恢复完整的数据库
- flutter - 带有不需要的填充/额外宽度的颤动材质按钮
- multithreading - Jmeter将线程数作为动态值传递
- java - 为什么使用 set = map.entryset(); 时不必初始化集合?
- parse-platform - 使用 graphql 对解析服务器挂载的 POST 请求返回 400 Bad Request
- sql - 如何从异步 gcloud sql 导出作业中获取退出状态或完成消息?
- spring-cloud-dataflow - SCDF 为部署的流应用程序设置用户提供的环境变量
- java - java.lang.RuntimeException:在 Google Play 商店中上传时执行 doInBackground() 时发生错误
- arrays - 连续插入遇到的三个 ID 的所有唯一组合