python - 熊猫不会在空列上合并
问题描述
我有2个文件:
文件 1:
website,company_name,uuid
yahoo.com,Yahoo,1111
facebook.com,Facebook,2222
cisco.com,Cisco Inc,3333
,Oasis,4444
google.com,Google Inc,5555
文件 2:
company_url,company_name
facebook.com,
google.com,Google
,Netflix
,Pandas
我的脚本需要匹配网站或公司名称并生成如下输出:
company_url,company_name,uuid
facebook.com,,2222
google.com,google,5555
,netflix,
,pandas,
目前正在做以下工作:
company_url,company_name,uuid
facebook.com,,2222.0
google.com,google,5555.0
,netflix,4444
,pandas,4444
似乎与 File1 中的空网站匹配。
代码:
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
功能:
def MatchCompetitorCompanies(
companies: pandas.Dataframe,
competitor_companies: pandas.Dataframe) -> Optional[Sequence[str]]:
"""Find Competitor companies in companies dataframe and generate a new list.
Args:
companies: A dataframe with company information from CSV file.
competitor_companies: A dataframe with Competitor information from CSV file.
Returns:
A sequence of matched companies and their UUID.
Raises:
ValueError: No companies found.
"""
if _IsEmpty(companies):
raise ValueError('No companies found')
# Clean up empty fields.
companies = companies.fillna('')
logging.info('Found: %d records.', len(competitor_companies))
competitor_companies = competitor_companies.fillna('')
# Rename column to TLD to compare matching companies.
companies.rename(columns={'website': 'tld'}, inplace=True)
logging.info('Cleaning up company name.')
companies.company_name = companies.company_name.apply(_NormalizeText)
competitor_companies.company_name = competitor_companies.company_name.apply(
_NormalizeText)
# Rename column to TLD since Competitor already contains TLD in company_url.
competitor_companies.rename(columns={'company_url': 'tld'}, inplace=True)
logging.info('Extracting UUID')
merge_tld = competitor_companies.merge(
companies[['tld', 'uuid']], on='tld', how='left')
# Extracts UUID for company name matches.
competitor_companies = competitor_companies.merge(
companies[['company_name', 'uuid']], on='company_name', how='left')
# Combines dataframes.
competitor_companies['uuid'] = competitor_companies['uuid'].combine_first(
merge_tld['uuid'])
match_companies = len(
competitor_companies[competitor_companies['uuid'].notnull()])
total_companies = len(competitor_companies)
logging.info('Results found: %d out of %d', match_companies, total_companies)
competitor_companies.rename(columns={'tld': 'company_url'}, inplace=True)
return competitor_companies
我使用了这个技巧:
companies = companies.fillna(' ')
但是看看有没有更好的解决方案。
解决方案
一方面,合并df1
和df2
上键'website'
/ 'company_url'
(目的:提取'company_url'
(完整)和'uuid'
(部分)信息)
df3 = (df1
.merge(df2, left_on='website', right_on='company_url', how='right')
.drop(['website', 'company_name_x'], axis=1)
)
df3 = (df3
.loc[~(df3.company_url.isnull())]
.reset_index(drop=True)
.rename(columns={'company_name_y': 'company_name'})
)
df3
:
uuid company_url company_name
0 2222 facebook.com NaN
1 5555 google.com Google
另一方面,merging df1
and df2
on key 'company_name'
(目的:提取'uuid'
本玩具示例中不需要的其他(部分)信息)
df4 = (df1
.merge(df2, on='company_name', how='right')
.drop(['website', 'company_url'], axis=1)
)
df4 = (df4
.loc[~df4.company_name.isnull()]
.reset_index(drop=True)
)
df4
:
company_name uuid
0 Google NaN
1 Netflix NaN
2 Pandas NaN
合并df3
其中' 不为空,然后进一步附加到合并的数据帧,其中df4
为空'company_name
df4
'company_name'
df5 = (df3
.loc[~df3.company_name.isnull()]
.merge(df4, on='company_name', how='right')
)
df5['uuid'] = df5['uuid_x'].fillna(df5['uuid_y'])
df5 = (df5
.drop(['uuid_x', 'uuid_y'], axis=1)
.append(df3.loc[df3.company_name.isnull()], ignore_index=True, sort=False)
.fillna("")
)
df5['company_name'] = df5.company_name.str.lower()
df5
:
company_name company_url uuid
0 google google.com 5555
1 netflix
2 pandas
3 facebook.com 2222
希望这可以帮助。
推荐阅读
- python - 在 pandas 的另一个中创建一个具有值的新列
- javascript - HTTP 请求后 JSON 输入意外结束
- django - Django 管理员能够将图像上传到 S3 存储桶,但无法加载到模板上
- ios - 为具有多个部分的表视图创建搜索栏
- javascript - 如何通过调用 next(error) 来测试 express 中间件是否抛出错误
- c# - Asp.Net Core 3.1 MVC 客户端验证不适用于位于 Areas 文件夹中的视图
- python - python discord.py 二十一点游戏输了就会给硬币
- python - 使用烧瓶运行时找不到烧瓶模块错误
- angular - 是否可以在子窗口上订阅 observable?
- java - 隐藏我网站网址末尾的 .html