首页 > 解决方案 > 连接两个数据框并根据列值删除重复行

问题描述

我有两个数据框。

df1

    Name Symbol         ID
0    Jay    N/A    372Y105
1    Ray    N/A    4446100
2   Faye    N/A    484MAA4
3   Maye    N/A    504W308
4    Kay    N/A    782L107
5   Trey    FFF    782L111

df2

    Name Symbol         ID
0    Jay    AAA    372Y105
1   Faye    CCC    484MAA4
2    Kay    EEE    782L107

如果在andID之间匹配,我想用from替换in ,这样输出看起来像:df1df2symboldf1symboldf2

    Name Symbol         ID
0    Jay    AAA    372Y105
1    Ray    N/A    4446100
2   Faye    CCC    484MAA4
3   Maye    N/A    504W308
4    Kay    EEE    782L107
5   Trey    FFF    782L111

听起来我应该首先连接两个数据帧,然后以某种方式删除重复项,例如,

df3 = pd.concat([df1, df2])
df3 = df3.drop_duplicates(subset='ID', keep='last')

但是,我不想只保留第一个或最后一个副本,我只想删除那些 where symbol= N/A

标签: pythonpandas

解决方案


首先使用merge左连接,然后SymbolSymbol_列替换缺失值:

print (df1.merge(df2, on=['Name','ID'], how='left', suffixes=('', '_')))
   Name Symbol       ID Symbol_
0   Jay    NaN  372Y105     AAA
1   Ray    NaN  4446100     NaN
2  Faye    NaN  484MAA4     CCC
3  Maye    NaN  504W308     NaN
4   Kay    NaN  782L107     EEE
5  Trey    FFF  782L111     NaN

df = (df1.merge(df2, on=['Name','ID'], how='left', suffixes=('', '_'))
         .assign(Symbol = lambda x: x['Symbol'].fillna(x.pop('Symbol_'))))
print (df)
   Name Symbol       ID
0   Jay    AAA  372Y105
1   Ray    NaN  4446100
2  Faye    CCC  484MAA4
3  Maye    NaN  504W308
4   Kay    EEE  782L107
5  Trey    FFF  782L111

另一个解决方案DataFrame.update

df1 = df1.set_index(['Name','ID'])
df2 = df2.set_index(['Name','ID'])
df1.update(df2)
df1 = df1.reset_index()
print (df1)
   Name       ID Symbol
0   Jay  372Y105    AAA
1   Ray  4446100    NaN
2  Faye  484MAA4    CCC
3  Maye  504W308    NaN
4   Kay  782L107    EEE
5  Trey  782L111    FFF

推荐阅读