python - 如何使用 Regex findall 查找多个模式?
问题描述
我的任务是从字符串中获取“li”、“ul”标签并计算它们的数量。这是我尝试过的,它有效但正在寻找更好的解决方案
字符串
<ul><li>Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li>Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li>Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li>Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>,</ul>
代码:
liTag = re.findall('<li>',String)
ulTag = re.findall('<ul>',String)
count = len(liTag) + len(ulTag)
解决方案
在您的示例re
中是很好的解决方案,您不必搜索其他方法。
最终你可以把它写成
tags = re.findall('<(li|ul)>', html)
print(len(tags))
但是如果你得到更复杂的标签,比如<ul class="...">
(或更复杂),那么regex
将无法工作,更好(更容易)是使用lxml
,BeautifulSoup
或其他 HTML 解析器。
lxml:
import lxml.html
html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''
soup = lxml.html.fromstring(html)
li_tags = soup.xpath('//li')
ul_tags = soup.xpath('//ul')
count = len(li_tags) + len(ul_tags)
print(count)
你甚至可以尝试
tags = soup.xpath('//ul|//li')
print(len(tags))
美丽汤:
from bs4 import BeautifulSoup
html = '''<ul id="list">
<li class="first">Regularly wash your hands for 20 seconds or use a hand sanitizer with at least 60 percent alcohol. Pay attention to hand hygiene, especially when you’ve been in a public place and after coughing, sneezing, or blowing your nose.</li>
<li name="second">Practice <a href="https://www.answers.com/Q/What_is_social_distancing" rel="nofollow ugc">social distancing</a> by increasing the space between you and other people. That means staying home as much as you can, especially if you feel sick.</li>
<li style="color:red">Disinfect frequently touched surfaces (like keyboards, doorknobs, and light switches) every day.</li>
<li data="last">Cover coughs and sneezes with the inside of your elbow or a tissue. Throw the tissue away immediately and wash your hands.</li>
</ul>'''
soup = BeautifulSoup(html, 'html.parser')
li_tags = soup.find_all('li')
ul_tags = soup.find_all('ul')
count = len(li_tags) + len(ul_tags)
print(count)
你甚至可以做
tags = soup.find_all(('ul', 'li'))
print(len(tags))
编辑:在html
每个标签<ul>
中,<li>
我添加了额外的信息 - id
, class
, name
, style
, data
- 并且代码仍然可以正常工作而无需更改。
因为regex
它需要'<(li|ul).*>'
or '<(ul|li)'
。但是对于更复杂的事情,它需要更复杂的更改。
推荐阅读
- c++ - 如何在没有因子的情况下打印数字的因子数请查看我的代码
- python - 当我召回时,python 中的决策树处方。当 class=0 和 class=1 分别时,我如何获得这两个?
- openjdk-11 - OpenJDK 11 的来源(包括错误/安全修复版本)
- ios - iOS Autolayout 高度对于 UILabel 不明确
- python-3.x - 如何在 Google Colab Notebook 上使用 python3 安装 caffe
- solidity - 使用 % 缩短一个整数
- pytorch - 如何将 Torch 自定义数据集与 fastai 数据加载器一起使用
- django - Django - 'Paginator' 对象没有属性 'get_page'
- javascript - 将嵌套 JSON 追加或添加到另一个 JSON 对象
- python - 我可以使用 FastAPI 生成动态 apiKey