首页 > 解决方案 > Fast implementation of max value per user pandas

问题描述

Following is a piece of code i'm using, it takes each user and takes a single value for each user, according to a sorting scheme, the problem is that it runs relatively slow to my needs, was wondering if it can be implemented faster:

import pandas as pd

df1 = pd.DataFrame({'user': ['a', 'b', 'c', 'd'],
                   'user_info': [1, 3, 5, 6]},
                   columns=['user', 'user_info'])

df2 = pd.DataFrame({'user': ['a', 'b', 'f', 'h'],
                   'user_info': [3, 5, 5, 6]},
                   columns=['user', 'user_info'])


data_frames_dict_with_importance_score = {2: df2,
                                          1: df1}


def apply_importance(df, importance):
    df['tag_max'] = importance
    return df


join_list = ['user', 'user_info']

final_recommendations = pd.concat([apply_importance(df[join_list], importance)
                                   for importance, df in data_frames_dict_with_importance_score.items()])

final_recommendations = final_recommendations.sort_values(['user', 'tag_max'], ascending=False).groupby(
    ['user'], as_index=False).head(1)
final_recommendations.reset_index(inplace=True)

Any help on that one would be awsome!

标签: pandas

解决方案


You can assign the tag_max in a list comprehension then concat with sort_values followed by drop duplicates:

out = pd.concat((v.assign(tag_max=k) for 
                 k,v in data_frames_dict_with_importance_score.items()))\
.sort_values(['user', 'tag_max'], ascending=False).drop_duplicates('user')

Or:

out = pd.concat(data_frames_dict_with_importance_score,names=['tag_max','Index'])\
.reset_index().sort_values(['user', 'tag_max'], ascending=False).drop_duplicates('user')

  user  user_info  tag_max
3    h          6        2
2    f          5        2
3    d          6        1
2    c          5        1
1    b          5        2
0    a          3        2

推荐阅读