首页 > 解决方案 > 多次匹配后合并 Pandas 数据框

问题描述

我有 2 个数据框,我在其中找到基于列 ( tld) 的常见匹配项,如果找到匹配项(在 和 中的列之间source),我将列 ( ) 的值从源destination复制到数据框。我还检查不同的列是否匹配。( ) 然后提取.uuiddestinationcompany_nameuuid

现在我需要比较不同的列 ( similar_companies) 并提取uuid

数据框 1:来源

   uuid           website company_name           tld
0     1a  www.facebook.com     facebook  facebook.com
1     2b     www.yahoo.com    yahoo inc     yahoo.com
2     3c    www.google.com       Google    google.com
3     4d     www.cisco.com        Cisco     cisco.com

数据框 2:目的地

  id           website  company_name           tld  match uuid
0  a  www.facebook.com      facebook  facebook.com  False  NaN
1  b         www.y.com     Yahoo Inc         y.com  False  NaN
2  c         www.g.com        Google         g.com  False  NaN
3  d         www.g.com    Google Inc         g.com  False  NaN
4  e  www.facebook.com  Facebook Inc  facebook.com  False  NaN

期望的输出:

id           website  company_name           tld  match similar_companies
0  a  www.facebook.com      facebook  facebook.com   True          Facebook   
1  b         www.y.com     Yahoo Inc         y.com  False              None   
2  c         www.g.com        Google         g.com   True              None   
3  d         www.g.com    Google Inc         g.com  False              None   
4  e  www.facebook.com      Facebook  facebook.com   True          facebook   
5  f       www.face.uk  Facebook Inc       face.uk   True          facebook   

  uuid  
0   1a  
1  NaN  
2   3c  
3  NaN  
4   1a  
5   1a

当前代码:

# Find if TLD is the same.
match_tld = destination.tld.isin(source.tld)
# Find if Company name is the same.
match_company_name = destination.company_name.isin(
      source.company_name)
# Find similar source.
destination[
      _SIMILAR_COMPANIES] = destination.company_name.apply(
          _FindSimilarCompanies, args=(destination,))
# Find if Company name is the same from similar source.
match_similar_companies = destination.similar_companies.isin(
      source.company_name)
# Update match column if TLD or company_name matches.
destination['match'] = match_tld | match_company_name | match_similar_companies
# Extract UUID for TLD matches.
merge_tld = destination.merge(
      source[['tld', 'uuid']], on='tld', how='left')
# Extract UUID for company name matches.
destination = destination.merge(
      source[['company_name', 'uuid']], on='company_name', how='left')
# I insert new line here!!!
# Combine dataframes.
destination['uuid'] = destination['uuid'].combine_first(merge_tld['uuid'])
logging.info(source)
logging.info(destination)

上面的代码适用于 2 列,但是当我尝试合并新列时,我得到一个 KeyError:(我在插入新代码的地方添加了一条评论)

destination = destination.merge(
      source[['company_name', 'uuid']], on='similar_companies', how='left')

错误:

KeyError: 'similar_companies'

标签: pythonpandasdataframe

解决方案


我认为问题是没有列similar_companiessource所以是必要的rename

#for sample data column
_SIMILAR_COMPANIES = 'similar_companies'
destination[_SIMILAR_COMPANIES] = destination.company_name.str.extract('([fF]acebook)')

destination1 = destination.merge(
      source[['company_name', 'uuid']], on='company_name', how='left')

destination2 = (destination.merge(
       source[['company_name', 'uuid']].rename(columns={'company_name':'similar_companies'}),
       on='similar_companies', how='left'))
# Combine dataframes.
merge_tld['uuid'] = (merge_tld['uuid'].combine_first(destination1['uuid'])
                                      .combine_first(destination2['uuid']))
print (merge_tld)
  id           website  company_name           tld  match similar_companies  \
0  a  www.facebook.com      facebook  facebook.com   True          facebook   
1  b         www.y.com     Yahoo Inc         y.com  False               NaN   
2  c         www.g.com        Google         g.com   True               NaN   
3  d         www.g.com    Google Inc         g.com  False               NaN   
4  e  www.facebook.com  Facebook Inc  facebook.com   True          Facebook   

  uuid  
0   1a  
1  NaN  
2   3c  
3  NaN  
4   1a  

推荐阅读