首页 > 解决方案 > 匹配和连接两个不一致的 DataFrame

问题描述

我有两个数据框正在从两个具有共同特征但并不总是相同特征的独立数据库中查询,我需要找到一种方法将两者可靠地连接在一起。

举个例子:

import pandas as pd
inp = [{'Name':'Jose', 'Age':12,'Location':'Frankfurt','Occupation':'Student','Mothers Name':'Rosy'}, {'Name':'Katherine','Age':23,'Location':'Maui','Occupation':'Lawyer','Mothers Name':'Amy'}, {'Name':'Larry','Age':22,'Location':'Dallas','Occupation':'Nurse','Mothers Name':'Monica'}]
df = pd.DataFrame(inp)
print (df)

   Age   Location Mothers Name       Name Occupation
0   12  Frankfurt         Rosy       Jose    Student
1   23       Maui          Amy  Katherine     Lawyer
2   22     Dallas       Monica      Larry      Nurse



inp2 = [{'Name': '','Occupation':'Nurse','Favorite Hobby':'Basketball','Mothers Name':'Monica'},{'Name':'Jose','Occupation':'','Favorite Hobby':'Sewing','Mothers Name':'Rosy'},{'Name':'Katherine','Occupation':'Lawyer','Favorite Hobby':'Reading','Mothers Name':''}]
df2 = pd.DataFrame(inp2)
print(df2)

  Favorite Hobby Mothers Name       Name Occupation
0     Basketball       Monica                 Nurse
1         Sewing         Rosy       Jose           
2        Reading               Katherine     Lawyer

我需要找到一种方法来可靠地连接这两个数据帧,而数据不会始终保持一致。为了使问题进一步复杂化,两个数据库的长度并不总是相同。有任何想法吗?

标签: pythonpandas

解决方案


您可以在可能的列组合上执行合并并连接这些 df,然后在第一个(完整)df 上合并新的 df:

# do your three possible merges on 'Mothers Name', 'Name', and 'Occupation'
# then concat your dataframes
new_df = pd.concat([df.merge(df2, on=['Mothers Name', 'Name']), 
                    df.merge(df2, on=['Name', 'Occupation']),
                    df.merge(df2, on=['Mothers Name', 'Occupation'])], sort=False)

# take the first dataframe, which is complete, and merge with your new_df and drop dups
df.merge(new_df[['Age', 'Location', 'Favorite Hobby']], on=['Age', 'Location']).drop_duplicates()

    Age Location    Mothers Name    Name      Occupation    Favorite Hobby
0   12  Frankfurt      Rosy         Jose       Student        Sewing
2   23  Maui           Amy          Katherine  Lawyer         Reading
4   22  Dallas         Monica       Larry      Nurse          Basketball

这假设每行年龄和位置都是唯一的


推荐阅读