首页 > 解决方案 > 如何使用 Regex findall 查找多个模式?

问题描述

我的任务是从字符串中获取“li”、“ul”标签并计算它们的数量。这是我尝试过的,它有效但正在寻找更好的解决方案

字符串

<ul><li>Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li>Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li>Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li>Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>,</ul>

代码:

liTag = re.findall('<li>',String)
ulTag = re.findall('<ul>',String)
count = len(liTag) + len(ulTag)

标签: pythonregexstring

解决方案


在您的示例re中是很好的解决方案,您不必搜索其他方法。

最终你可以把它写成

tags = re.findall('<(li|ul)>', html)
print(len(tags))

但是如果你得到更复杂的标签,比如<ul class="...">(或更复杂),那么regex将无法工作,更好(更容易)是使用lxmlBeautifulSoup或其他 HTML 解析器。

lxml:

import lxml.html

html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''

soup = lxml.html.fromstring(html)

li_tags = soup.xpath('//li')
ul_tags = soup.xpath('//ul')
count = len(li_tags) + len(ul_tags)

print(count)

你甚至可以尝试

tags = soup.xpath('//ul|//li')
print(len(tags))

美丽汤:

from bs4 import BeautifulSoup

html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''

soup = BeautifulSoup(html, 'html.parser')

li_tags = soup.find_all('li')
ul_tags = soup.find_all('ul')
count = len(li_tags) + len(ul_tags)

print(count)

你甚至可以做

tags = soup.find_all(('ul', 'li'))
print(len(tags))

编辑:html每个标签<ul>中,<li>我添加了额外的信息 - id, class, name, style, data- 并且代码仍然可以正常工作而无需更改。

因为regex它需要'<(li|ul).*>'or '<(ul|li)'。但是对于更复杂的事情,它需要更复杂的更改。


推荐阅读