python - Pandas：返回每组出现频率最高的值（可能没有应用）

问题描述

让我们假设输入数据集：

test1 = [[0,7,50], [0,3,51], [0,3,45], [1,5,50],[1,0,50],[2,6,50]]
df_test = pd.DataFrame(test1, columns=['A','B','C'])

对应于：

    A   B   C
0   0   7   50
1   0   3   51
2   0   3   45
3   1   5   50
4   1   0   50
5   2   6   50

我想获得按“A”分组的数据集，以及每组中“B”最常见的值，以及该值的出现：

A   most_freq freq
0   3          2
1   5          1
2   6          1

我可以通过以下方式获得前两列：

grouped = df_test.groupby("A")
out_df = pd.DataFrame(index=grouped.groups.keys())
out_df['most_freq'] = df_test.groupby('A')['B'].apply(lambda x: x.value_counts().idxmax())

但我在最后一列遇到问题。另外：有没有更快的方法不涉及“应用”？该解决方案不适用于较大的输入（我也尝试过 dask）。

非常感谢！

标签： pythonpandaspandas-groupby

默认情况下使用SeriesGroupBy.value_countswhich 排序，然后在之后添加DataFrame.drop_duplicates最高值Series.reset_index：

df = (df_test.groupby('A')['B']
             .value_counts()
             .rename_axis(['A','most_freq'])
             .reset_index(name='freq')
             .drop_duplicates('A'))
print (df)
   A  most_freq  freq
0  0          3     2
2  1          0     1
4  2          6     1

python - Pandas：返回每组出现频率最高的值（可能没有应用）

问题描述

解决方案

推荐阅读