本文转自https://freeaihub.com/article/regex-module-in-python.html,该页可在线互动学习
本节通过对Python标准模块:re
模块的实例的使用,了解和掌握该标准库中正则表达式相关知识。re模块与PCRE并不完全兼容,但它支持正则表达式的大多数常见用例。
注意,此参考适用于Python 3
原始Python字符串
使用Python编写正则表达式时,建议您使用原始字符串 而不是常规Python字符串。原始字符串以一个特殊的前缀(r
)开头,并指示Python不要在字符串中解释反斜杠和特殊的元字符,从而使您可以将它们直接传递给正则表达式引擎。
这意味着像这样的模式"\n\w"
将不会被解释,r"\n\w"
而是可以"\\n\\w"
像其他语言一样被编写,而不是像其他语言那样易于阅读。
匹配字符串
该re
程序包具有许多顶级方法,并且可以使用来测试正则表达式是否与Python中的特定字符串匹配re.search()
。None
如果模式不匹配,则此方法返回,或者返回re.MatchObject
有关找到匹配的字符串的哪一部分的附加信息。
请注意,此方法在第一个匹配项之后停止,因此它最适合于测试正则表达式,而不是提取数据。
方法
matchObject = re.search(pattern, input_str, flags=0)
例
import re
# Lets use a regular expression to match a date string. Ignore
# the output since we are just testing if the regex matches.
regex = r"([a-zA-Z]+) (\d+)"
if re.search(regex, "June 24"):
# Indeed, the expression "([a-zA-Z]+) (\d+)" matches the date string
# If we want, we can use the MatchObject's start() and end() methods
# to retrieve where the pattern matches in the input string, and the
# group() method to get all the matches and captured groups.
match = re.search(regex, "June 24")
# This will print [0, 7), since it matches at the beginning and end of the
# string
print("Match at index %s, %s" % (match.start(), match.end()))
# The groups contain the matched values. In particular:
# match.group(0) always returns the fully matched string
# match.group(1), match.group(2), ... will return the capture
# groups in order from left to right in the input string
# match.group() is equivalent to match.group(0)
# So this will print "June 24"
print("Full match: %s" % (match.group(0)))
# So this will print "June"
print("Month: %s" % (match.group(1)))
# So this will print "24"
print("Day: %s" % (match.group(2)))
else:
# If re.search() does not match, then None is returned
print("The regex pattern does not match. :(")
捕获组
与上述re.search()
方法不同,我们可以用来re.findall()
对整个输入字符串执行全局搜索。如果模式中有捕获组,则它将返回所有捕获数据的列表,否则,它将仅返回匹配项本身的列表,如果未找到匹配项,则返回空列表。
如果每个匹配都需要其他上下文,则可以使用re.finditer()
which来返回迭代的迭代器re.MatchObjects
。两种方法都采用相同的参数。
方法
matchList = re.findall(pattern, input_str, flags=0)
matchList = re.finditer(pattern, input_str, flags=0)
例
import re
# Lets use a regular expression to match a few date strings.
regex = r"[a-zA-Z]+ \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
# This will print:
# June 24
# August 9
# Dec 12
print("Full match: %s" % (match))
# To capture the specific months of each date we can use the following pattern
regex = r"([a-zA-Z]+) \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
# This will now print:
# June
# August
# Dec
print("Match month: %s" % (match))
# If we need the exact positions of each match
regex = r"([a-zA-Z]+) \d+"
matches = re.finditer(regex, "June 24, August 9, Dec 12")
for match in matches:
# This will now print:
# 0 7
# 9 17
# 19 25
# which corresponds with the start and end of each match in the input string
print("Match at index: %s, %s" % (match.start(), match.end()))
查找和替换字符串
另一个常见任务是使用正则表达式查找和替换字符串的一部分,例如,替换旧电子邮件域的所有实例,或交换某些文本的顺序。您可以使用re.sub()
方法在Python中执行此操作。
可选count
参数是输入字符串中要进行替换的确切数目,如果该值小于或等于零,则替换字符串中的每个匹配项。
方法
replacedString = re.sub(pattern, replacement_pattern, input_str, count, flags=0)
例
import re
# Lets try and reverse the order of the day and month in a date
# string. Notice how the replacement string also contains metacharacters
# (the back references to the captured groups) so we use a raw
# string for that as well.
regex = r"([a-zA-Z]+) (\d+)"
# This will reorder the string and print:
# 24 of June, 9 of August, 12 of Dec
print(re.sub(regex, r"\2 of \1", "June 24, August 9, Dec 12"))
re
标志
在上面的Python正则表达式方法中,您会注意到它们每个都带有一个可选 flags
参数。大多数可用的标志很方便,可以直接将它们写入正则表达式本身,但是有些标志在某些情况下很有用。
re.IGNORECASE
使模式不区分大小写,以便与不同大小写的字符串匹配re.MULTILINE
如果输入字符串包含换行符(\ n)是必需的,则此标志允许开始和结束元字符(分别为^和$)在每行的开头和结尾进行匹配,而不是在整个输入字符串的开头和结尾进行匹配re.DOTALL
允许点(。)元字符匹配所有字符,包括换行符(\ n)
编译性能模式
在Python中,创建新的正则表达式模式以匹配许多字符串可能会很慢,因此,如果需要使用同一表达式测试或从许多输入字符串中提取信息,建议您对其进行编译。此方法返回re.RegexObject
。
regexObject = re.compile(pattern, flags=0)
返回的对象具有与上面完全相同的方法,除了它们采用输入字符串并且不再需要每个调用的模式或标志。
import re
# Lets create a pattern and extract some information with it
regex = re.compile(r"(\w+) World")
result = regex.search("Hello World is the easiest")
if result:
# for the start and end of the match
print(result.start(), result.end())
# for each of the captured groups that matched
for result in regex.findall("Hello World, Bonjour World"):
print(result)
# This will substitute "World" with "Earth" and print:
print(regex.sub(r"\1 Earth", "Hello World"))