python - Pandas 使用正则表达式映射两个数据框
问题描述
我有两个数据框,一个带有文本信息,另一个带有正则表达式和模式,我需要做的是使用正则表达式从第二个数据框映射一列
编辑:我需要做的是在所有 df['text'] 行上应用每个正则表达式,如果匹配,则将 Pattern 添加到新列中
样本数据
text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
'sales for the overseas customers',
'marketing approach is driving strong play from top tier customers',
'employees in India have been the continuance of remote work will impact productivity',
'sales due to higher customer']}
regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
'(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
'(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}
df
text
0 customer and increased repair and remodel acti...
1 sales for the overseas customers
2 marketing approach is driving strong play from...
3 employees in India have been the continuance o...
4 sales due to higher customer
正则表达式
Pattern regex
0 Sales + customer (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1 Marketing + customer (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2 Employee * Productivity (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...
期望的输出
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
尝试了以下方法,创建了一个在匹配时返回 Pattern 的函数,然后我遍历正则表达式数据框中的所有列
def finding_keywords(regex, match, keyword):
if re.search(regex, match):
return keyword
else:
pass
for index, row in regex.iterrows():
df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))
这样做的问题是,在每次迭代中,它都会删除以前的映射,如下所示。因为我是 foo foo 是最后一次迭代,是唯一剩下的一个模式
text Pattern
0 foo None
1 bar None
2 foo foo I'm foo foo
3 foo bar None
4 bar bar None
一种解决方案可能是在正则表达式数据帧上运行迭代,然后对 df 进行迭代,这样我可以避免丢失信息,但我正在寻找最快的解决方案
解决方案
您可以循环遍历数据框的唯一值regex
并应用于text
该df
帧并在新regex
列中返回模式。然后,合并Pattern
列并删除regex
列。
我的方法的关键是首先创建列NaN
,然后在每次迭代中填充,这样列就不会被覆盖。
import re
import numpy as np
srs = regex['regex'].unique()
df['regex'] = np.nan
for reg in srs:
df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg
if re.search(reg, x) else np.NaN))
df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)
df
Out[1]:
text Pattern
0 customer and increased repair and remodel acti... Sales + customer
1 sales for the overseas customers Sales + customer
2 marketing approach is driving strong play from... Marketing + customer
3 employees in India have been the continuance o... Employee * Productivity
4 sales due to higher customer Sales + customer
推荐阅读
- django-rest-framework - 如何在 django rest 框架中定义多个油门
- android - 使用 Firebase 登录 Facebook 与 FacebookSDK
- javascript - 悬停时更改 CSS 写入文本框中的特定字符颜色
- linux-kernel - spin_lock 和 spin_lock_bh 有什么区别
- codenameone - Eclipse CN1 项目中的“jar_project”生成器在刷新时循环
- python - 如何将纯文本标题和列表传输到 Python 字典对象?
- node.js - 在私有的、范围内的 NPM 包中搜索
- python-3.x - 从标签中提取href后得到#
- sql - 根据计算值对列进行排序
- ffmpeg - ffmpeg - 在具有复杂过滤器的视频上绘制旋转文本需要很长时间