python - 如何选择两组中的前 N 个并将第二组的其余部分聚合到 Pandas 的“其他”中？

问题描述

我有一个包含产品、价格、类别和县的数据。我使用此代码来计算每个县每个类别的产品数量：

df_count = df.groupby(['County','Category']).size().reset_index(name='counts')

我的数据框现在看起来像这样：

	县	类别	计数
0	布莱金厄	配饰和手表	35
1	布莱金厄	音频视频	101
2	布莱金厄	自行车	78
3	布莱金厄	船用零件和配件	65
4	布莱金厄	船	143
...	...	...	...
657	东约特兰	雪地车零件和配件	2
658	东约特兰	雪地摩托	5
659	东约特兰	运动休闲设备	335
660	东约特兰	工具	102
661	东约特兰	卡车和建筑	66

662行×3列

有21个县32个大类。计数是一个类别中的产品数量。在一个县，并非所有类别都是必需的。

我想要一个新的数据框，其中包含每个县的前 N 个（例如 2 个）最大类别，并将其余部分汇总到“其他”中。我希望每个县都有这个，它看起来像这样：

县	类别	计数
布莱金厄	船	143
布莱金厄	音频视频	101
布莱金厄	其他	178
...	...	...
东约特兰	运动休闲设备	335
东约特兰	工具	102
东约特兰	其他	175

我看过以前的帖子对数组做了类似的事情

如何将 Top N 之外的“剩余”结果分组到 Pandas 的“其他”中

在 pandas df 中对前 N 进行排序并将“其他”分组

并尝试了这个

# group by & sort descending
df_sorted=df_count.groupby(['County','Category']).sum().sort_values('counts', ascending=False).reset_index()

# rename rows other than top-n to 'Others'
x_sorted.groupby('County').loc[x_sorted.index >=3, 'Category'] = 'Others'

和这个

df_count.sort_values(by=['counts'], ascending=False).groupby('County').head(2).sort_values(by=['County']).reset_index(drop=True)

#not_top2 = df.groupby('Version').sum().sort('Value', ascending=False).index[2:]
not_top2 = x.groupby(['County','Category']).sum().sort_values('counts', ascending=False).index[2:]

dfnew = x.replace(not_top2, 'Other')

dfnew.groupby(['County','Category']).sum()

但没有成功获得所需的输出。

非常感谢任何帮助或指导！

标签： pythonpandaspandas-groupby

解决方案

您可以使用以下步骤序列来获得最终输出，我认为这相当简单。

为了使其易于理解，我将在代码和每行输出中添加注释。

# Grab top 2 largest caterogies of each country
top_two = df.groupby('County').apply(lambda x: x.nlargest(2, 'counts')).reset_index(drop=True)  

>>> top_two
         County                    Category  counts
0      Blekinge                       Boats     143
1      Blekinge               Audio & video     101
2  Östergötland  Sports & leisure equipment     335
3  Östergötland                       Tools     102

# Create a dataframe with the rest of the information
df_others = df.append(df.merge(top_two,'inner')).drop_duplicates(keep=False)

>>> df_others
         County                        Category  counts
0      Blekinge           Accessories & watches      35
2      Blekinge                        Bicycles      78
3      Blekinge        Boat parts & accessories      65
5  Östergötland  Snowmobile parts & accessories       2
6  Östergötland                     Snowmobiles       5
9  Östergötland           Trucks & construction      66

# Groupby country and Sum and assign 'others' under Category in the df_others dataframe
df_others = df_others.groupby('County')['counts'].sum().reset_index()
df_others['Category'] = 'Others'

>>> df_others
         County  counts Category
0      Blekinge     178   Others
1  Östergötland      73   Others

最后，concat()获得最终输出的两个数据框：

res = pd.concat([top_two,df_others]).sort_values('County').reset_index(drop=True)
>>> res
         County                    Category  counts
0      Blekinge                       Boats     143
1      Blekinge               Audio & video     101
2      Blekinge                      Others     178
3  Östergötland  Sports & leisure equipment     335
4  Östergötland                       Tools     102
5  Östergötland                      Others      73

如果有不清楚的地方请回来。

python - 如何选择两组中的前 N ​​个并将第二组的其余部分聚合到 Pandas 的“其他”中？

问题描述

解决方案

推荐阅读

python - 如何选择两组中的前 N 个并将第二组的其余部分聚合到 Pandas 的“其他”中？