首页 > 解决方案 > str.split(',', expand = True) 太……慢了,如何提升性能?

问题描述

我有以下数据框:

SFDC,SID,MID,ACT
DC02,SID1,GOAL,"view_goal_list"
DC02,SID1,GOAL,"view_goal_card,expand_people_selector_panel"
DC02,SID1,GOAL,"view_goal_list,select_user,click_add_activity"

我想将 ACT 列转换为以下格式:

SFDC,SID,MID,step1,step2,step3
DC02,SID1,GOAL,view_goal_list,na,na DC02,SID1,GOAL,view_goal_card,expand_people_selector_panel,na DC02,SID1,GOAL,view_goal_list,select_user,click_add_activity

这是我使用的代码,在功能上它可以工作,但是在处理大约 5000k 记录时性能太差了(需要几个小时。)

df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")

任何专家都可以帮助提供快速性能的解决方案吗?

标签: pandasdataframeperformancesplit

解决方案


或许你可以把它降下来一点...

import pandas as pd

df = pd.read_csv('split.txt')

# 'split.txt' is the example data given in the question copied over and over
print(df.shape)
print(df.head())

(50000, 4)
   SFDC   SID   MID                                            ACT
0  DC02  SID1  GOAL                                 view_goal_list
1  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel
2  DC02  SID1  GOAL  view_goal_list,select_user,click_add_activity
3  DC02  SID1  GOAL                                 view_goal_list
4  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel

我的时间:

[172 毫秒] 当前方法:

%%timeit

df = pd.read_csv('split.txt')

df = df.set_index(['SFDC','SID', 'MID'])['ACT'].astype(str).str.split(',', expand = True).rename(columns=lambda x: f"step{x+1}")

df = df.reset_index()

# 172 ms ± 2.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[152 毫秒] 分离、拆分和连接(更快一点):

%%timeit

df = pd.read_csv('split.txt')

s = df['ACT'].str.split(',', expand=True)

s = s.add_prefix('step_')

df = df.join(s)

# 152 ms ± 1.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[93 ms] Apply 速度很快,因为它进出函数的速度更快:

%%timeit

df = pd.read_csv('split.txt')

def splitCol(s):
    return s.split(',')

s = df['ACT'].apply(splitCol).to_list()

s = pd.DataFrame(s)

s = s.add_prefix('step_')
# if required comment out the above line and instead rename columns 0,1,2,3 etc. to step_1, step_2, etc. rather than zero
#s.columns = ['step_' + str(col+1) for col in s.columns]

df = df.join(s)

# 93 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

[90.3 ms] 直接 str.split().tolist() 并加入

似乎是最快的(给定 ± 3.64 ms)。稍微歪斜,因为这个代码块s.columns = ['step_' + str(col+1) for col in s.columns]s = s.add_prefix('step_')

%%timeit

df = pd.read_csv('split.txt')

def splitCol(x):
    return pd.Series(x.split(','))

s = pd.DataFrame()

s = df['ACT'].str.split(',').to_list()

s = pd.DataFrame(s)

# this seems quicker than s = s.add_prefix('step_')
s.columns = ['step_' + str(col+1) for col in s.columns]

df = df.join(s)

# 90.3 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

示例输出:

print(df.head())

   SFDC   SID   MID                                            ACT  \
0  DC02  SID1  GOAL                                 view_goal_list   
1  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel   
2  DC02  SID1  GOAL  view_goal_list,select_user,click_add_activity   
3  DC02  SID1  GOAL                                 view_goal_list   
4  DC02  SID1  GOAL    view_goal_card,expand_people_selector_panel   

           step_0                        step_1              step_2  
0  view_goal_list                          None                None  
1  view_goal_card  expand_people_selector_panel                None  
2  view_goal_list                   select_user  click_add_activity  
3  view_goal_list                          None                None  
4  view_goal_card  expand_people_selector_panel                None 

如果您需要新列开始step_1step_0不是:

s = s.add_prefix('step_')

采用:

# rename columns 0,1,2,3 etc. to step_1, step_2, etc.
s.columns = ['step_' + str(col+1) for col in s.columns]

推荐阅读