首页 > 解决方案 > 贪婪和懒惰的量词。使用 HTML 标签进行测试

问题描述

输入是

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>

预期的第一个输出是(因为我使用的是贪婪量词)

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>

用于贪心的代码如下

text = '''
<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>
'''
pattern=re.compile(r'\<p\>.*\<\/p\>')
data1=pattern.match(text,re.MULTILINE)
print('data1:- ',data1,'\n')

预期的第二个输出是(因为我使用的是惰性量词)

<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>

用于惰性的代码如下

text = '''
<p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p>
'''
#pattern=re.compile(r'\<p\>.*?\<\/p\>')
pattern=re.compile(r'<p>.*?</p>')
data1=pattern.match(text,re.MULTILINE)
print('data1:- ',data1,'\n')

我得到 None 都是实际输出的情况

标签: htmlregexpython-3.xregex-greedy

解决方案


你有几个问题。首先,使用 时Pattern.match,第二个和第三个参数是位置参数,而不是标志。需要将标志指定为re.compile. 其次,您应该使用re.DOTALL匹配.换行符,而不是re.MULTILINE. 最后 -match坚持匹配发生在字符串的开头(在你的情况下是换行符),所以它不会匹配。您可能想Pattern.search改用。这将适用于您的示例输入:

pattern=re.compile(r'<p>.*</p>', re.DOTALL)
data1=pattern.search(text)
print('data1:- ',data1.group(0),'\n')

输出:

data1:-  <p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p>
<p>
Then you have to find the end of the paragraph
</p> 

单场比赛:

pattern=re.compile(r'<p>.*?</p>', re.DOTALL)
data1=pattern.search(text)
print('data1:- ',data1.group(0),'\n')

输出:

data1:-  <p>
The very <em>first</em> task is to find the beginning of a paragraph.
</p> 

还要注意/, <and>在正则表达式中没有特殊含义,不需要转义。我已经在上面的代码中删除了它。


推荐阅读