首页 > 解决方案 > Pandas 使用正则表达式映射两个数据框

问题描述

我有两个数据框,一个带有文本信息,另一个带有正则表达式和模式,我需要做的是使用正则表达式从第二个数据框映射一列

编辑:我需要做的是在所有 df['text'] 行上应用每个正则表达式,如果匹配,则将 Pattern 添加到新列中

样本数据

text_dict = {'text':['customer and increased repair and remodel activity as well as from other sales',
             'sales for the overseas customers',
             'marketing approach is driving strong play from top tier customers',
             'employees in India have been the continuance of remote work will impact productivity',
             'sales due to higher customer']}

regex_dict = {'Pattern':['Sales + customer', 'Marketing + customer', 'Employee * Productivity'],
             'regex': ['(?:sales\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:sales\\w*)',
                       '(?:marketing\\w*)(?:[^,.?])*(?:customer\\w*)|(?:customer\\w*)(?:[^,.?])*(?:marketing\\w*)',
                       '(?:employee\\w*)(?:[^\n])*(?:productivity\\w*)|(?:productivity\\w*)(?:[^\n])*(?:employee\\w*)']}

df

                                                text
0  customer and increased repair and remodel acti...
1                   sales for the overseas customers
2  marketing approach is driving strong play from...
3  employees in India have been the continuance o...
4                       sales due to higher customer

正则表达式

                   Pattern                                              regex
0         Sales + customer  (?:sales\w*)(?:[^,.?])*(?:customer\w*)|(?:cust...
1     Marketing + customer  (?:marketing\w*)(?:[^,.?])*(?:customer\w*)|(?:...
2  Employee * Productivity  (?:employee\w*)(?:[^\n])*(?:productivity\w*)|(...

期望的输出

                                                text    Pattern
0  customer and increased repair and remodel acti...    Sales + customer
1                   sales for the overseas customers    Sales + customer
2  marketing approach is driving strong play from...    Marketing + customer
3  employees in India have been the continuance o...    Employee * Productivity
4                       sales due to higher customer    Sales + customer

尝试了以下方法,创建了一个在匹配时返回 Pattern 的函数,然后我遍历正则表达式数据框中的所有列

def finding_keywords(regex, match, keyword):
    if re.search(regex, match):
        return keyword
    else:
        pass

for index, row in regex.iterrows():
    df['Pattern'] = df['text'].apply(lambda x: finding_keywords(regex['Regex'][index], x, regex['Pattern'][index]))

这样做的问题是,在每次迭代中,它都会删除以前的映射,如下所示。因为我是 foo foo 是最后一次迭代,是唯一剩下的一个模式

      text      Pattern
0      foo         None
1      bar         None
2  foo foo  I'm foo foo
3  foo bar         None
4  bar bar         None

一种解决方案可能是在正则表达式数据帧上运行迭代,然后对 df 进行迭代,这样我可以避免丢失信息,但我正在寻找最快的解决方案

标签: pythonregexpandas

解决方案


您可以循环遍历数据框的唯一值regex并应用于textdf帧并在新regex列中返回模式。然后,合并Pattern列并删除regex列。

我的方法的关键是首先创建列NaN,然后在每次迭代中填充,这样列就不会被覆盖。

import re
import numpy as np

srs = regex['regex'].unique()
df['regex'] = np.nan

for reg in srs:
    df['regex'] = df['regex'].fillna(df['text'].apply(lambda x: reg 
                               if re.search(reg, x) else np.NaN))

df = pd.merge(df, regex, how='left', on='regex').drop('regex', axis=1)

df

Out[1]: 
                                                text                  Pattern
0  customer and increased repair and remodel acti...         Sales + customer
1                   sales for the overseas customers         Sales + customer
2  marketing approach is driving strong play from...     Marketing + customer
3  employees in India have been the continuance o...  Employee * Productivity
4                       sales due to higher customer         Sales + customer

推荐阅读