首页 > 解决方案 > 熊猫不会在空列上合并

问题描述

我有2个文件:

文件 1:

website,company_name,uuid
yahoo.com,Yahoo,1111
facebook.com,Facebook,2222
cisco.com,Cisco Inc,3333
,Oasis,4444
google.com,Google Inc,5555

文件 2:

company_url,company_name
facebook.com,
google.com,Google
,Netflix
,Pandas

我的脚本需要匹配网站或公司名称并生成如下输出:

company_url,company_name,uuid
facebook.com,,2222
google.com,google,5555
,netflix,
,pandas,

目前正在做以下工作:

company_url,company_name,uuid
facebook.com,,2222.0
google.com,google,5555.0
,netflix,4444
,pandas,4444

似乎与 File1 中的空网站匹配。

代码:

  merge_tld = competitor_companies.merge(
      companies[['tld', 'uuid']], on='tld', how='left')      
  # Extracts UUID for company name matches.
  competitor_companies = competitor_companies.merge(
      companies[['company_name', 'uuid']], on='company_name', how='left')
  # Combines dataframes.
  competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
      merge_tld['uuid'])

功能:

def MatchCompetitorCompanies(
    companies: pandas.Dataframe,
    competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
  """Find Competitor companies in companies dataframe and generate a new list.

  Args:
    companies: A dataframe with company information from CSV file.
    competitor_companies: A dataframe with Competitor information from CSV file.

  Returns:
    A sequence of matched companies and their UUID.

  Raises:
    ValueError: No companies found.
  """

  if _IsEmpty(companies):
    raise ValueError('No companies found')
  # Clean up empty fields.
  companies = companies.fillna('')
  logging.info('Found: %d records.', len(competitor_companies))
  competitor_companies = competitor_companies.fillna('')
  # Rename column to TLD to compare matching companies.
  companies.rename(columns={'website': 'tld'}, inplace=True)
  logging.info('Cleaning up company name.')
  companies.company_name = companies.company_name.apply(_NormalizeText)
  competitor_companies.company_name = competitor_companies.company_name.apply(
      _NormalizeText)
  # Rename column to TLD since Competitor already contains TLD in company_url.
  competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)      
  logging.info('Extracting UUID')
  merge_tld = competitor_companies.merge(
      companies[['tld', 'uuid']], on='tld', how='left')      
  # Extracts UUID for company name matches.
  competitor_companies = competitor_companies.merge(
      companies[['company_name', 'uuid']], on='company_name', how='left')
  # Combines dataframes.
  competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
      merge_tld['uuid'])
  match_companies = len(
      competitor_companies[competitor_companies['uuid'].notnull()])
  total_companies = len(competitor_companies)
  logging.info('Results found: %d out of %d', match_companies, total_companies)
  competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
  return competitor_companies

我使用了这个技巧:

companies = companies.fillna(' ')

但是看看有没有更好的解决方案。

标签: pythonpandas

解决方案


一方面,合并df1df2上键'website'/ 'company_url'(目的:提取'company_url'(完整)和'uuid'(部分)信息)

df3 = (df1
       .merge(df2, left_on='website', right_on='company_url', how='right')
       .drop(['website', 'company_name_x'], axis=1)
       )
df3 = (df3
       .loc[~(df3.company_url.isnull())]
       .reset_index(drop=True)
       .rename(columns={'company_name_y': 'company_name'})
      )

df3

   uuid   company_url company_name
0  2222  facebook.com          NaN
1  5555    google.com       Google

另一方面,merging df1and df2on key 'company_name'(目的:提取'uuid'本玩具示例中不需要的其他(部分)信息)

df4 = (df1
       .merge(df2, on='company_name', how='right')
       .drop(['website', 'company_url'], axis=1)
       )
df4 = (df4
       .loc[~df4.company_name.isnull()]
       .reset_index(drop=True)
      )

df4

  company_name  uuid
0       Google   NaN
1      Netflix   NaN
2       Pandas   NaN

合并df3其中' 不为空,然后进一步附加到合并的数据帧,其中df4为空'company_namedf4'company_name'

df5 = (df3
       .loc[~df3.company_name.isnull()]
       .merge(df4, on='company_name', how='right')
       )
df5['uuid'] = df5['uuid_x'].fillna(df5['uuid_y'])
df5 = (df5
       .drop(['uuid_x', 'uuid_y'], axis=1)
       .append(df3.loc[df3.company_name.isnull()], ignore_index=True, sort=False)
       .fillna("")
       )
df5['company_name'] = df5.company_name.str.lower()

df5

  company_name   company_url  uuid
0       google    google.com  5555
1      netflix                    
2       pandas                    
3               facebook.com  2222 

希望这可以帮助。


推荐阅读