python - 使用正则表达式按顺序匹配多个单词
问题描述
我正在尝试创建一个函数,该函数将根据这些条件从文本中返回一个字符串:
- 如果字符串中有“recurring payment authorized on”,则获取“on”之后的第一个文本
- 如果字符串中有“经常性付款”,请先获取所有内容
目前我已经写了以下内容:
#will be used in an apply statement for a column in dataframe
def parser(x):
x_list = x.split()
if " recurring payment authorized on " in x and x_list[-1]!= "on":
return x_list[x_list.index("on")+1]
elif " recurring payment" in x:
return ' '.join(x_list[:x_list.index("recurring")])
else:
return None
然而,这段代码看起来很笨拙,而且不够健壮。我想使用正则表达式来匹配这些字符串。
以下是此函数应返回的一些示例:
recurring payment authorized on usps abc
应该返回usps
usps recurring payment abc
应该返回usps
任何有关为此函数编写正则表达式的帮助将不胜感激。输入字符串将只包含文本;不会有数字和特殊字符
解决方案
将正则表达式与前瞻和后瞻模式匹配一起使用
import re
def parser(x):
# Patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]
for t in tests:
found = parser(t)
if found:
print("In text: {}\n Found: {}".format(t, found))
输出
In text: recurring payment authorized on usps abc
Found: usps
In text: usps recurring payment abc
Found: usps
In text: recurring payment authorized on att xxx xxx
Found: att
In text: recurring payment authorized on 25.05.1980 xxx xxx
Found: 25.05.1980
In text: att recurring payment xxxxx
Found: att
In text: 12.14.14. att recurring payment xxxxx
Found: 12.14.14. att
解释
正则表达式后视
(?<=foo) Lookbehind 断言紧接在字符串中当前位置之前的是 foo
所以在模式中: r'(?<= 授权于 )(.*?)(\s+)'
foo is " authorized on "
(.*?) - matches any character (? causes it not to be greedy)
(\s+) - matches at least one whitespace
因此,上述导致 (.*?) 捕获“授权上”之后的所有字符,直到第一个空白字符。
正则表达式前瞻
(?=foo) Lookahead 断言紧跟在字符串中当前位置之后的是 foo
所以:r'^(.*?)\s(?=recurring payment)'
foo is 'recurring payment'
^ - means at beginning of the string
(.*?) - matches any character (non-greedy)
\s - matches white space
因此, (.*?) 将匹配字符串开头的所有字符,直到我们得到空格,然后是“定期付款”
更好的性能 是可取的,因为您正在申请可能有很多列的 Dataframe。
将模式编译从解析器中取出并放入模块中(时间减少 33%)。
def parser(x):
# Use predined patterns (pattern_on, pattern_recur) from globals
m = pattern_on.search(t)
if m:
return m.group(0)
m = pattern_recur.search(t)
if m:
return m.group(0)
return None
# Define patterns to search
pattern_on = re.compile(r'(?<= authorized on )(.*?)(\s+)')
pattern_recur = re.compile(r'^(.*?)\s(?=recurring payment)')
tests = ["recurring payment authorized on usps abc", "usps recurring payment abc", "recurring payment authorized on att xxx xxx", "recurring payment authorized on 25.05.1980 xxx xxx", "att recurring payment xxxxx", "12.14.14. att recurring payment xxxxx"]