首页 > 解决方案 > 如何标记熊猫数据框中的重复行

问题描述

我得到了这个数据框,并想添加一个列来指示 client_id 是否存在多次。

client     age-group     category
1       <18           basic
1       <18           premium
2       <18           premium
3       <18           premium
4       18-24           basic
5       18-24           basic
6       <18           basic
5       <18           premium
2       <18           basic
7       <18           basic

至:

client     age-group     category      regular_client
1       <18           basic            yes
1       <18           premium          yes
2       <18           premium          yes
3       <18           premium          no
4       18-24           basic          no
5       18-24           basic          yes
6       <18           basic            no
5       <18           premium          yes
2       <18           basic            yes
7       <18           basic            no

我知道的唯一方法是

for idx, _ in df.iterrows():

但我很确定有一种更快、更容易的可能性。

标签: pythonpandasnumpy

解决方案


使用Series.duplicated+ Series.map

df['regular_client'] = df['client'].duplicated(keep=False).map({True:'yes', False:'no'})

Series.duplicated+ np.where,

df['regular_client'] = np.where(df['client'].duplicated(keep=False), 'yes', 'no')

结果:

   client age-group category regular_client
0       1       <18    basic            yes
1       1       <18  premium            yes
2       2       <18  premium            yes
3       3       <18  premium             no
4       4     18-24    basic             no
5       5     18-24    basic            yes
6       6       <18    basic             no
7       5       <18  premium            yes
8       2       <18    basic            yes
9       7       <18    basic             no

推荐阅读