pandas - str.split(',', expand = True) 太……慢了,如何提升性能?
问题描述
我有以下数据框:
SFDC,SID,MID,ACT
DC02,SID1,GOAL,"view_goal_list"
DC02,SID1,GOAL,"view_goal_card,expand_people_selector_panel"
DC02,SID1,GOAL,"view_goal_list,select_user,click_add_activity"
我想将 ACT 列转换为以下格式:
SFDC,SID,MID,step1,step2,step3
DC02,SID1,GOAL,view_goal_list,na,na DC02,SID1,GOAL,view_goal_card,expand_people_selector_panel,na DC02,SID1,GOAL,view_goal_list,select_user,click_add_activity
这是我使用的代码,在功能上它可以工作,但是在处理大约 5000k 记录时性能太差了(需要几个小时。)
df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")
任何专家都可以帮助提供快速性能的解决方案吗?
解决方案
或许你可以把它降下来一点...
import pandas as pd
df = pd.read_csv('split.txt')
# 'split.txt' is the example data given in the question copied over and over
print(df.shape)
print(df.head())
(50000, 4)
SFDC SID MID ACT
0 DC02 SID1 GOAL view_goal_list
1 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
2 DC02 SID1 GOAL view_goal_list,select_user,click_add_activity
3 DC02 SID1 GOAL view_goal_list
4 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
我的时间:
[172 毫秒] 当前方法:
%%timeit
df = pd.read_csv('split.txt')
df = df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")
df = df.reset_index()
# 172 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[152 毫秒] 分离、拆分和连接(更快一点):
%%timeit
df = pd.read_csv('split.txt')
s = df['ACT'].str.split(',', expand=True)
s = s.add_prefix('step_')
df = df.join(s)
# 152 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[93 ms] Apply 速度很快,因为它进出函数的速度更快:
%%timeit
df = pd.read_csv('split.txt')
def splitCol(s):
return s.split(',')
s = df['ACT'].apply(splitCol).to_list()
s = pd.DataFrame(s)
s = s.add_prefix('step_')
# if required comment out the above line and instead rename columns 0,1,2,3 etc. to step_1, step_2, etc. rather than zero
#s.columns = ['step_' + str(col+1) for col in s.columns]
df = df.join(s)
# 93 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[90.3 ms] 直接 str.split().tolist() 并加入
似乎是最快的(给定 ± 3.64 ms)。稍微歪斜,因为这个代码块s.columns = ['step_' + str(col+1) for col in s.columns]
比s = s.add_prefix('step_')
%%timeit
df = pd.read_csv('split.txt')
def splitCol(x):
return pd.Series(x.split(','))
s = pd.DataFrame()
s = df['ACT'].str.split(',').to_list()
s = pd.DataFrame(s)
# this seems quicker than s = s.add_prefix('step_')
s.columns = ['step_' + str(col+1) for col in s.columns]
df = df.join(s)
# 90.3 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
示例输出:
print(df.head())
SFDC SID MID ACT \
0 DC02 SID1 GOAL view_goal_list
1 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
2 DC02 SID1 GOAL view_goal_list,select_user,click_add_activity
3 DC02 SID1 GOAL view_goal_list
4 DC02 SID1 GOAL view_goal_card,expand_people_selector_panel
step_0 step_1 step_2
0 view_goal_list None None
1 view_goal_card expand_people_selector_panel None
2 view_goal_list select_user click_add_activity
3 view_goal_list None None
4 view_goal_card expand_people_selector_panel None
如果您需要新列开始step_1
而step_0
不是:
s = s.add_prefix('step_')
采用:
# rename columns 0,1,2,3 etc. to step_1, step_2, etc.
s.columns = ['step_' + str(col+1) for col in s.columns]
推荐阅读
- javascript - TamperMonkey:无法读取 null 的属性“dispatchEvent”
- c# - NUnit - Equal(this, that), True(this == that) 或 That (this, Is.EqualTo(that))?
- ios - FaceBook SDK - 根本无法记录应用事件?
- kubernetes - 用于 google(和 facebook)识别的 Kubernetes 环境变量
- ms-access - 当我在字符串字段中输入条目时字符消失
- kotlin - 在 Kotlin 中序列化非数据类
- reactjs - 无法从 Minikube 访问 React 应用程序
- django - 在管理面板中输入数据时如何在models.py中设置特定条件?
- python - 我有两个方程 f(x,y,k)=0 和 g(x,y,k)=0。我想在 Python 的 x,y 平面上绘制相交曲线
- c# - 如何限制连接的参数数量?