首页 > 解决方案 > 1000 万个大数据帧上的 for 循环的更好替代方案?

问题描述

我写了一个运行良好的代码。如下所示: 我需要优化运行时。

for i in range(len(df)):
    try:
        if df['event_name'][i] in ['add_basket_click','remove_basket_click'] and df['event_name'][i-1]=='product_search':
            try:
                if df['event_desc'][i]['firebase_screen_id']==df['event_desc'][i-1]['firebase_screen_id']:
                    df.at[i,'search_process']=1
            except:
                pass
    except:
        pass

下面是一个示例数据集:

user_id event_name  event_desc
10  product_search  {'firebase_previous_id': '8996730796507124997'}
10  add_basket_click    {'firebase_previous_id': '8996730796507124997'}
10  start   {'firebase_previous_id': '8996730796507124997'}
10  add_basket_click    {'firebase_previous_id': '8996730796507124997'}

输出:

user_id event_name  event_desc  search_process
10  product_search  {'firebase_previous_id': '8996730796507124997'} 0
10  add_basket_click    {'firebase_previous_id': '8996730796507124997'} 1
10  start   {'firebase_previous_id': '8996730796507124997'} 0
10  add_basket_click    {'firebase_previous_id': '8996730796507124997'} 0

标签: pythonpandas

解决方案


我相信您需要在列firebase_previous_idfirebase_screen_id的字典中进行测试event_desc

m1 = df['event_name'].shift() =='product_search'
m2 = df['event_name'].isin(['add_basket_click','remove_basket_click'])
#changed values for non matched values after get
s1 = df['event_desc'].apply(lambda x: x.get('firebase_previous_id', 'not_m'))
s2 = df['event_desc'].apply(lambda x: x.get('firebase_previous_id', 'not_matched'))
m3 = s1 == s2.shift()

df['search_process'] = (m1 & m2 & m3).astype(int)
print (df)
   user_id        event_name                                       event_desc  \
0       10    product_search  {'firebase_previous_id': '8996730796507124997'}   
1       10  add_basket_click  {'firebase_previous_id': '8996730796507124997'}   
2       10             start  {'firebase_previous_id': '8996730796507124997'}   
3       10  add_basket_click  {'firebase_previous_id': '8996730796507124997'}   

   search_process  
0               0  
1               1  
2               0  
3               0  

推荐阅读