首页 > 解决方案 > 使用 groupby 的前 3 个最常见的值

问题描述

我有以下数据框:

Client_id   Product_id    Product_description     quantity
001           1353            orange                10
001           1353            orange                10
001           1354            lime                  5
001           1200            pen                   1

004           1354            orange                10
...

我想获得一个数据框,为每个客户报告前 3 名最畅销的产品(因此,尺寸为 (n_customers x 4) ):

Client_id   product_id_1    product_description_1   product_id_2    product_description_2    product_id_3    product_description_3 
001               1353            orange                 1354                lime               1200                   pen
...

如何创建这种类型的数据框?

标签: pythonpandasgroup-by

解决方案


尝试这个:

# Group by, so that no duplicates across the variables of interest exist
tmp = df.groupby(["client_id", "product_id", "product_description"], as_index=False
                         ).agg({"quantity":"sum"})


# Get the top 3 products per client based on quantity
tmp = df.sort_values(['client_id', 'quantity'], ascending=[True, False]).groupby('client_id').head(3)
tmp['order'] = (tmp.groupby('client_id').cumcount() + 1)
tmp = tmp.set_index(['client_id', 'order']).unstack()


# Rename the columns to match your desired format
cols1 = pd.MultiIndex.from_product([[1,2,3], ['product_id', 'product_description']]).swaplevel()
cols2 = cols1.get_level_values(0) + '_' + cols1.get_level_values(1).astype('str')
result = tmp[cols1].set_axis(cols2, axis=1)

推荐阅读