首页 > 解决方案 > 使用正则表达式按顺序匹配多个单词

问题描述

我正在尝试创建一个函数,该函数将根据这些条件从文本中返回一个字符串:

  1. 如果字符串中有“recurring payment authorized on”,则获取“on”之后的第一个文本
  2. 如果字符串中有“经常性付款”,请先获取所有内容

目前我已经写了以下内容:

#will be used in an apply statement for a column in dataframe
def parser(x):
    x_list = x.split()
    if " recurring payment authorized on " in x and x_list[-1]!= "on":
         return x_list[x_list.index("on")+1]
     elif " recurring payment" in x:
         return ' '.join(x_list[:x_list.index("recurring")])
     else:
         return None

然而,这段代码看起来很笨拙,而且不够健壮。我想使用正则表达式来匹配这些字符串。

以下是此函数应返回的一些示例:

  1. recurring payment authorized on usps abc应该返回usps

  2. usps recurring payment abc应该返回usps

任何有关为此函数编写正则表达式的帮助将不胜感激。输入字符串将只包含文本;不会有数字和特殊字符

标签: pythonregexpandas

解决方案


将正则表达式与前瞻和后瞻模式匹配一​​起使用

import re

def parser(x):
    # Patterns to search
    pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
    pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')

    m = pattern_on.search(t)
    if m:
        return m.group(0)

    m = pattern_recur.search(t)
    if m:
        return m.group(0)

    return None

tests =  ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]


for t in tests:
    found = parser(t)
    if found:
        print("In text: {}\n Found: {}".format(t, found))

输出

In text: recurring payment authorized on usps abc
 Found: usps 
In text: usps recurring payment abc
 Found: usps 
In text: recurring payment authorized on att xxx xxx
 Found: att 
In text: recurring payment authorized on 25.05.1980 xxx xxx
 Found: 25.05.1980 
In text: att recurring payment xxxxx
 Found: att 
In text: 12.14.14. att recurring payment xxxxx
 Found: 12.14.14. att 

解释

Lookahead 和 Lookbehind 模式匹配

正则表达式后视

(?<=foo) Lookbehind 断言紧接在字符串中当前位置之前的是 foo

所以在模式中: r'(?<= 授权于 )(.*?)(\s+)'

foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace

因此,上述导致 (.*?) 捕获“授权上”之后的所有字符,直到第一个空白字符。

正则表达式前瞻

(?=foo) Lookahead 断言紧跟在字符串中当前位置之后的是 foo

所以:r'^(.*?)\s(?=recurring payment)'

foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space

因此, (.*?) 将匹配字符串开头的所有字符,直到我们得到空格,然后是“定期付款”

更好的性能 是可取的,因为您正在申请可能有很多列的 Dataframe。

将模式编译从解析器中取出并放入模块中(时间减少 33%)。

def parser(x):
    # Use predined patterns (pattern_on, pattern_recur) from globals
    m = pattern_on.search(t)
    if m:
        return m.group(0)

    m = pattern_recur.search(t)
    if m:
        return m.group(0)

    return None

 # Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')

tests =  ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]

推荐阅读