pandas - 添加一列，说明一条记录是否跨数据集出现

问题描述

我有 2 个 dfs，我想一起联系并删除重复项，但不是在添加一个列之前，说明是否来自 df_b 的记录将由于重复数据删除而被删除，可以说明它是否在两个 dfs 中发生，否则该列将保持空白，说明 df_b 中没有出现该记录（不是跨 dfs 的重复）。

期望的结果df_combined

df_a

    title             director
0   Toy Story         John Lasseter
1   Goodfellas        Martin Scorsese
2   Meet the Fockers  Jay Roach
3   The Departed      Martin Scorsese

df_b

    title             director
0   Toy Story         John Lass
1   The Hangover      Todd Phillips
2   Rocky             John Avildsen
3   The Departed      Martin Scorsese


df_combine =  pd.concat([df_a, df_b], ignore_index=True, sort=False)
df_combined

title                 director.         occurence_both
0   Toy Story         John Lasseter     b
1   Goodfellas        Martin Scorsese
2   Meet the Fockers  Jay Roach      
3   The Departed      Martin Scorsese   b
5   The Hangover      Todd Phillips
6   Rocky             John Avildsen

标签： pandasduplicates

我们可以使用duplicatedwithkeep=False标记所有重复项np.where并将布尔系列转换为 'b' 和 ''。然后跟进drop_duplicates删除重复的行。这两个操作都应该是仅title列的子集：

df_combine = pd.concat([df_a, df_b], ignore_index=True, sort=False)
# Mark Duplicates
df_combine['occurence_both'] = np.where(
    df_combine.duplicated(subset='title', keep=False), 'b', ''
)
# Drop Duplicates
df_combine = df_combine.drop_duplicates(subset='title')

df_combine：

              title         director occurence_both
0         Toy Story    John Lasseter              b
1        Goodfellas  Martin Scorsese               
2  Meet the Fockers        Jay Roach               
3      The Departed  Martin Scorsese              b
5      The Hangover    Todd Phillips               
6             Rocky    John Avildsen

pandas - 添加一列，说明一条记录是否跨数据集出现

问题描述

解决方案

推荐阅读