首页 > 解决方案 > 计算按 ID pandas 分组的重复数

问题描述

我不确定这是否是一个重复的问题,但它就是这样。

假设我有下表:

import pandas

lst = [1,1,1,2,2,3,3,4,5] 
lst2 = ['A','A','B','D','E','A','A','A','E'] 
  
df = pd.DataFrame(list(zip(lst, lst2)), 
               columns =['ID', 'val'])

将输出下表

+----+-----+
| ID | Val |
+----+-----+
| 1  | A   |
+----+-----+
| 1  | A   |
+----+-----+
| 1  | B   |
+----+-----+
| 2  | D   |
+----+-----+
| 2  | E   |
+----+-----+
| 3  | A   |
+----+-----+
| 3  | A   |
+----+-----+
| 4  | A   |
+----+-----+
| 5  | E   |
+----+-----+

目标是计算按 ID 分组的 VAL 上的重复项:

+----+-----+--------------+
| ID | Val | is_duplicate |
+----+-----+--------------+
| 1  | A   | 1            |
+----+-----+--------------+
| 1  | A   | 1            |
+----+-----+--------------+
| 1  | B   | 0            |
+----+-----+--------------+
| 2  | D   | 0            |
+----+-----+--------------+
| 2  | E   | 0            |
+----+-----+--------------+
| 3  | A   | 1            |
+----+-----+--------------+
| 3  | A   | 1            |
+----+-----+--------------+
| 4  | A   | 0            |
+----+-----+--------------+
| 5  | E   | 0            |
+----+-----+--------------+

我尝试了以下代码,但它计算了整体重复项

 df_grouped = df.groupby(['notes']).size().reset_index(name='count')

而下面的代码只做重复计数

 df.duplicated(subset=['notes'])

什么是最好的方法?

标签: pythonpandas

解决方案


让我们试试duplicated

df['is_dup']=df.duplicated(subset=['ID','val'],keep=False).astype(int)
df
Out[21]: 
   ID val  is_dup
0   1   A       1
1   1   A       1
2   1   B       0
3   2   D       0
4   2   E       0
5   3   A       1
6   3   A       1
7   4   A       0
8   5   E       0

推荐阅读