首页 > 解决方案 > 如何在python pandas中显示重复的ID和重复的数据

问题描述

我有一个如下所示的数据框

k={'ID':[1,2,3,4,5,6],'Name':['John Danny','Micheal K','John Danny','jerred','John Danny','joe'],'phone':['1111',
                                                                                   '2222','2233','1111','2222','6666']}
df=pd.DataFrame(data=k)
df
    ID  Name       phone
    1   John Danny  1111
    2   Micheal K   2222
    3   John Danny  2233
    4   jerred      1111
    5   John Danny  2222

我需要在数据框中找到重复的姓名和电话,所以使用下面给出的代码

df[df['Name'].duplicated(keep=False)].sort_values("Name")

根据名称重复


ID  Name       phone
1   John Danny  1111
3   John Danny  2233
5   John Danny  2222

根据手机重复

    ID  Name       phone
    1   John Danny  1111
    4   jerred      1111
    2   Micheal K   2222
    5   John Danny  2222

但我想要的结果如下

ID  Name      phone duplicated of name ids  duplicated of phone ids Duplicate_name  Duplicate_phone
1   John Danny  1111    3,5                    4                    Yes              Yes
2   Micheal K   2222                           5                     No              Yes
3   John Danny  2233    1,5                                          Yes              No
4   jerred      1111                           1                     No              Yes
5   John Danny  2222    1,3                    2                     Yes             Yes

我能够使用下面的代码找到duplicate_name 和duplicate_phone

df['Duplicate_name'] = df['Name'].duplicated(keep=False).map({True:'Yes', False:'No'})
df['Duplicate_phone'] = df['phone'].duplicated(keep=False).map({True:'Yes', False:'No'})

问题是无法在重复的电话 ID 和重复的姓名 ID 中显示 ID,如上给出的结果表怎么办?

标签: pythonpandasloopsif-statementduplicates

解决方案


GroupBy.transform与减法的自定义函数一起使用set

def f(x):
    return [', '.join(set(x) - set([y])) for y in x]

或使用if生成器过滤:

def f(x):
    return [', '.join(z for z in x if z != y) for y in x]

df['duplicated of name ids'] = df['ID'].astype(str).groupby(df['Name']).transform(f)
df['duplicated of phone ids'] = df['ID'].astype(str).groupby(df['phone']).transform(f)


df['Duplicate_name'] = df['Name'].duplicated(keep=False).map({True:'Yes', False:'No'})
df['Duplicate_phone'] = df['phone'].duplicated(keep=False).map({True:'Yes', False:'No'})
print (df)
   ID        Name phone duplicated of name ids duplicated of phone ids  \
0   1  John Danny  1111                   5, 3                       4   
1   2   Micheal K  2222                                              5   
2   3  John Danny  2233                   5, 1                           
3   4      jerred  1111                                              1   
4   5  John Danny  2222                   1, 3                       2   
5   6         joe  6666                                                  

  Duplicate_name Duplicate_phone  
0            Yes             Yes  
1             No             Yes  
2            Yes              No  
3             No             Yes  
4            Yes             Yes  
5             No              No  

推荐阅读