python - 多次匹配后合并 Pandas 数据框
问题描述
我有 2 个数据框,我在其中找到基于列 ( tld
) 的常见匹配项,如果找到匹配项(在 和 中的列之间source
),我将列 ( ) 的值从源destination
复制到数据框。我还检查不同的列是否匹配。( ) 然后提取.uuid
destination
company_name
uuid
现在我需要比较不同的列 ( similar_companies
) 并提取uuid
。
数据框 1:来源
uuid website company_name tld
0 1a www.facebook.com facebook facebook.com
1 2b www.yahoo.com yahoo inc yahoo.com
2 3c www.google.com Google google.com
3 4d www.cisco.com Cisco cisco.com
数据框 2:目的地
id website company_name tld match uuid
0 a www.facebook.com facebook facebook.com False NaN
1 b www.y.com Yahoo Inc y.com False NaN
2 c www.g.com Google g.com False NaN
3 d www.g.com Google Inc g.com False NaN
4 e www.facebook.com Facebook Inc facebook.com False NaN
期望的输出:
id website company_name tld match similar_companies
0 a www.facebook.com facebook facebook.com True Facebook
1 b www.y.com Yahoo Inc y.com False None
2 c www.g.com Google g.com True None
3 d www.g.com Google Inc g.com False None
4 e www.facebook.com Facebook facebook.com True facebook
5 f www.face.uk Facebook Inc face.uk True facebook
uuid
0 1a
1 NaN
2 3c
3 NaN
4 1a
5 1a
当前代码:
# Find if TLD is the same.
match_tld = destination.tld.isin(source.tld)
# Find if Company name is the same.
match_company_name = destination.company_name.isin(
source.company_name)
# Find similar source.
destination[
_SIMILAR_COMPANIES] = destination.company_name.apply(
_FindSimilarCompanies, args=(destination,))
# Find if Company name is the same from similar source.
match_similar_companies = destination.similar_companies.isin(
source.company_name)
# Update match column if TLD or company_name matches.
destination['match'] = match_tld | match_company_name | match_similar_companies
# Extract UUID for TLD matches.
merge_tld = destination.merge(
source[['tld', 'uuid']], on='tld', how='left')
# Extract UUID for company name matches.
destination = destination.merge(
source[['company_name', 'uuid']], on='company_name', how='left')
# I insert new line here!!!
# Combine dataframes.
destination['uuid'] = destination['uuid'].combine_first(merge_tld['uuid'])
logging.info(source)
logging.info(destination)
上面的代码适用于 2 列,但是当我尝试合并新列时,我得到一个 KeyError:(我在插入新代码的地方添加了一条评论)
destination = destination.merge(
source[['company_name', 'uuid']], on='similar_companies', how='left')
错误:
KeyError: 'similar_companies'
解决方案
我认为问题是没有列similar_companies
,source
所以是必要的rename
:
#for sample data column
_SIMILAR_COMPANIES = 'similar_companies'
destination[_SIMILAR_COMPANIES] = destination.company_name.str.extract('([fF]acebook)')
destination1 = destination.merge(
source[['company_name', 'uuid']], on='company_name', how='left')
destination2 = (destination.merge(
source[['company_name', 'uuid']].rename(columns={'company_name':'similar_companies'}),
on='similar_companies', how='left'))
# Combine dataframes.
merge_tld['uuid'] = (merge_tld['uuid'].combine_first(destination1['uuid'])
.combine_first(destination2['uuid']))
print (merge_tld)
id website company_name tld match similar_companies \
0 a www.facebook.com facebook facebook.com True facebook
1 b www.y.com Yahoo Inc y.com False NaN
2 c www.g.com Google g.com True NaN
3 d www.g.com Google Inc g.com False NaN
4 e www.facebook.com Facebook Inc facebook.com True Facebook
uuid
0 1a
1 NaN
2 3c
3 NaN
4 1a
推荐阅读
- c# - 根据泛型参数类型实例化一个类
- android - Android Studio 3.2 布局“预览”没有正确模拟材料设计组件
- python - 你能调用这样的函数吗:func(x)(y)?
- hangout - 无法使用替代文本格式化链接
- javascript - 让当前页面访问者在 JS 中准备好文档
- docker - su:删除读取后权限被拒绝,执行其他位
- sql - SQL - 自加入表时避免重复
- sql-server - PySpark SQL 中的日期之间的差异
- javascript - 如何使用 PHP 处理多个表单输入
- matlab - 循环我的算法以在同一张图上绘制不同的参数值(MATLAB)