首页 > 解决方案 > 添加一列,说明一条记录是否跨数据集出现

问题描述

我有 2 个 dfs,我想一起联系并删除重复项,但不是在添加一个列之前,说明是否来自 df_b 的记录将由于重复数据删除而被删除,可以说明它是否在两个 dfs 中发生,否则该列将保持空白,说明 df_b 中没有出现该记录(不是跨 dfs 的重复)。

期望的结果df_combined

df_a

    title             director
0   Toy Story         John Lasseter
1   Goodfellas        Martin Scorsese
2   Meet the Fockers  Jay Roach
3   The Departed      Martin Scorsese

df_b

    title             director
0   Toy Story         John Lass
1   The Hangover      Todd Phillips
2   Rocky             John Avildsen
3   The Departed      Martin Scorsese


df_combine =  pd.concat([df_a, df_b], ignore_index=True, sort=False)
df_combined

title                 director.         occurence_both
0   Toy Story         John Lasseter     b
1   Goodfellas        Martin Scorsese
2   Meet the Fockers  Jay Roach      
3   The Departed      Martin Scorsese   b
5   The Hangover      Todd Phillips
6   Rocky             John Avildsen

标签: pandasduplicates

解决方案


我们可以使用duplicatedwithkeep=False标记所有重复项np.where并将布尔系列转换为 'b' 和 ''。然后跟进drop_duplicates删除重复的行。这两个操作都应该是仅title列的子集:

df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
# Mark Duplicates
df_combine['occurence_both'] = np.where(
    df_combine.duplicated(subset='title', keep=False), 'b', ''
)
# Drop Duplicates
df_combine = df_combine.drop_duplicates(subset='title')

df_combine

              title         director occurence_both
0         Toy Story    John Lasseter              b
1        Goodfellas  Martin Scorsese               
2  Meet the Fockers        Jay Roach               
3      The Departed  Martin Scorsese              b
5      The Hangover    Todd Phillips               
6             Rocky    John Avildsen               

推荐阅读