python - Python、Pandas：比较数据帧并分别保存旧的、更新的和新的

问题描述

晚上好，

假设我有两个数据框：

数据框 1：

id    |    first_name    |    last_name    |    age    |    personnel_number
1     |    Jane          |    Doe          |    37     |    0045ac
2     |    John          |    Doe          |    35     |    0102ha
3     |    Sarah         |    Smith        |    28     |    1003px
17    |    Michael       |    Mueller      |    61     |    0800pw

数据框 2：

id    |    first_name    |    last_name    |    age    |    personnel_number
1     |    Jane          |    Doe          |    37     |    0045ac
2     |    John          |    Doe          |    35     |    0102ha
3     |    Sarah         |    Smith        |    41     |    1003px
4     |    Sam           |    Smith        |    24     |    0017ix

我知道，通过以下代码，我得到了一个新的数据框，其中现有行正在更新并添加新行...

df_comp = df2.set_index('personnel_number').combine_first(df1.set_index('personnel_number')).reset_index()

...为达到这个：

组合数据框：

id    |    first_name    |    last_name    |    age    |    personnel_number
1     |    Jane          |    Doe          |    37     |    0045ac
2     |    John          |    Doe          |    35     |    0102ha
3     |    Sarah         |    Smith        |    41     |    1003px
17    |    Michael       |    Mueller      |    61     |    0800pw
4     |    Sam           |    Smith        |    24     |    0017ix

我的问题：有没有办法用以下数据实现三个而不是一个组合数据框：

具有未更改的现有数据的数据框
具有已更新行的数据框
具有新行的数据框

笔记

总是有一列具有唯一数据（本例中为“personnel_number”）

感谢您的帮助和建议，周末愉快！

标签： pythonpandasdataframecomparison

您可以尝试使用指标进行外部合并，然后使用 groupby 进行一些条件，然后将其存储在字典中：

out = df2.merge(df1,how='outer',indicator='group')

c = out.groupby("personnel_number",sort=False).transform('nunique').gt(1).any(1)

out['group'] = (np.select([out['group'].eq("both"),out['group'].ne("both") & c,
                           out['group'].isin(['both','left_only']) & ~c],
                          ['Already_exists','Updated','New']))

d = dict(iter(out.groupby("group")))

输出：

python - Python、Pandas：比较数据帧并分别保存旧的、更新的和新的

问题描述

解决方案

推荐阅读