首页 > 技术文章 > 在Python 3中使用正则表达式(re模块)

freeaihub 2020-06-17 09:00 原文

本文转自https://freeaihub.com/article/regex-module-in-python.html,该页可在线互动学习

本节通过对Python标准模块:re 模块的实例的使用,了解和掌握该标准库中正则表达式相关知识。re模块与PCRE并不完全兼容,但它支持正则表达式的大多数常见用例。

注意,此参考适用于Python 3

原始Python字符串

使用Python编写正则表达式时,建议您使用原始字符串 而不是常规Python字符串。原始字符串以一个特殊的前缀(r)开头,并指示Python不要在字符串中解释反斜杠和特殊的元字符,从而使您可以将它们直接传递给正则表达式引擎。

这意味着像这样的模式"\n\w"将不会被解释,r"\n\w" 而是可以"\\n\\w"像其他语言一样被编写,而不是像其他语言那样易于阅读。

匹配字符串

re程序包具有许多顶级方法,并且可以使用来测试正则表达式是否与Python中的特定字符串匹配re.search()None如果模式不匹配,则此方法返回,或者返回re.MatchObject有关找到匹配的字符串的哪一部分的附加信息。

请注意,此方法在第一个匹配项之后停止,因此它最适合于测试正则表达式,而不是提取数据。

方法

matchObject = re.search(pattern, input_str, flags=0) 

import re
# Lets use a regular expression to match a date string. Ignore
# the output since we are just testing if the regex matches.
regex = r"([a-zA-Z]+) (\d+)"
if re.search(regex, "June 24"):
    # Indeed, the expression "([a-zA-Z]+) (\d+)" matches the date string
    
    # If we want, we can use the MatchObject's start() and end() methods 
    # to retrieve where the pattern matches in the input string, and the 
    # group() method to get all the matches and captured groups.
    match = re.search(regex, "June 24")
    
    # This will print [0, 7), since it matches at the beginning and end of the 
    # string
    print("Match at index %s, %s" % (match.start(), match.end()))
    
    # The groups contain the matched values.  In particular:
    #    match.group(0) always returns the fully matched string
    #    match.group(1), match.group(2), ... will return the capture
    #            groups in order from left to right in the input string
    #    match.group() is equivalent to match.group(0)
    
    # So this will print "June 24"
    print("Full match: %s" % (match.group(0)))
    # So this will print "June"
    print("Month: %s" % (match.group(1)))
    # So this will print "24"
    print("Day: %s" % (match.group(2)))
else:
    # If re.search() does not match, then None is returned
    print("The regex pattern does not match. :(")

捕获组

与上述re.search()方法不同,我们可以用来re.findall() 对整个输入字符串执行全局搜索。如果模式中有捕获组,则它将返回所有捕获数据的列表,否则,它将仅返回匹配项本身的列表,如果未找到匹配项,则返回空列表。

如果每个匹配都需要其他上下文,则可以使用re.finditer() which来返回迭代的迭代器re.MatchObjects。两种方法都采用相同的参数。

方法

matchList = re.findall(pattern, input_str, flags=0) 
matchList = re.finditer(pattern, input_str, flags=0) 

import re
# Lets use a regular expression to match a few date strings.
regex = r"[a-zA-Z]+ \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
    # This will print:
    #   June 24
    #   August 9
    #   Dec 12
    print("Full match: %s" % (match))

# To capture the specific months of each date we can use the following pattern
regex = r"([a-zA-Z]+) \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
    # This will now print:
    #   June
    #   August
    #   Dec
    print("Match month: %s" % (match))

# If we need the exact positions of each match
regex = r"([a-zA-Z]+) \d+"
matches = re.finditer(regex, "June 24, August 9, Dec 12")
for match in matches:
    # This will now print:
    #   0 7
    #   9 17
    #   19 25
    # which corresponds with the start and end of each match in the input string
    print("Match at index: %s, %s" % (match.start(), match.end()))

查找和替换字符串

另一个常见任务是使用正则表达式查找和替换字符串的一部分,例如,替换旧电子邮件域的所有实例,或交换某些文本的顺序。您可以使用re.sub() 方法在Python中执行此操作。

可选count参数是输入字符串中要进行替换的确切数目,如果该值小于或等于零,则替换字符串中的每个匹配项。

方法

replacedString = re.sub(pattern, replacement_pattern, input_str, count, flags=0) 

import re
# Lets try and reverse the order of the day and month in a date 
# string. Notice how the replacement string also contains metacharacters
# (the back references to the captured groups) so we use a raw 
# string for that as well.
regex = r"([a-zA-Z]+) (\d+)"

# This will reorder the string and print:
#   24 of June, 9 of August, 12 of Dec
print(re.sub(regex, r"\2 of \1", "June 24, August 9, Dec 12"))

re 标志

在上面的Python正则表达式方法中,您会注意到它们每个都带有一个可选 flags参数。大多数可用的标志很方便,可以直接将它们写入正则表达式本身,但是有些标志在某些情况下很有用。

  • re.IGNORECASE 使模式不区分大小写,以便与不同大小写的字符串匹配
  • re.MULTILINE 如果输入字符串包含换行符(\ n)是必需的,则此标志允许开始和结束元字符(分别为^$)在每行的开头和结尾进行匹配,而不是在整个输入字符串的开头和结尾进行匹配
  • re.DOTALL 允许点()元字符匹配所有字符,包括换行符(\ n

编译性能模式

在Python中,创建新的正则表达式模式以匹配许多字符串可能会很慢,因此,如果需要使用同一表达式测试或从许多输入字符串中提取信息,建议您对其进行编译。此方法返回re.RegexObject

regexObject = re.compile(pattern, flags=0) 

返回的对象具有与上面完全相同的方法,除了它们采用输入字符串并且不再需要每个调用的模式或标志。

import re
# Lets create a pattern and extract some information with it
regex = re.compile(r"(\w+) World")
result = regex.search("Hello World is the easiest")
if result:
    # for the start and end of the match
    print(result.start(), result.end())

# for each of the captured groups that matched
for result in regex.findall("Hello World, Bonjour World"):
    print(result)

# This will substitute "World" with "Earth" and print:
print(regex.sub(r"\1 Earth", "Hello World"))

推荐阅读