python - 如何标记熊猫数据框中的重复行
问题描述
我得到了这个数据框,并想添加一个列来指示 client_id 是否存在多次。
client age-group category
1 <18 basic
1 <18 premium
2 <18 premium
3 <18 premium
4 18-24 basic
5 18-24 basic
6 <18 basic
5 <18 premium
2 <18 basic
7 <18 basic
至:
client age-group category regular_client
1 <18 basic yes
1 <18 premium yes
2 <18 premium yes
3 <18 premium no
4 18-24 basic no
5 18-24 basic yes
6 <18 basic no
5 <18 premium yes
2 <18 basic yes
7 <18 basic no
我知道的唯一方法是
for idx, _ in df.iterrows():
但我很确定有一种更快、更容易的可能性。
解决方案
使用Series.duplicated
+ Series.map
:
df['regular_client'] = df['client'].duplicated(keep=False).map({True:'yes', False:'no'})
或Series.duplicated
+ np.where
,
df['regular_client'] = np.where(df['client'].duplicated(keep=False), 'yes', 'no')
结果:
client age-group category regular_client
0 1 <18 basic yes
1 1 <18 premium yes
2 2 <18 premium yes
3 3 <18 premium no
4 4 18-24 basic no
5 5 18-24 basic yes
6 6 <18 basic no
7 5 <18 premium yes
8 2 <18 basic yes
9 7 <18 basic no