首页 > 解决方案 > Beautiful Soup 删除选择器后的首次出现

问题描述

我正在尝试使用 Beautiful Soup 从 HTML 文本中删除一些 HTML。

这可能是我的 HTML 示例:

<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>

关注这两个要素:

我正在尝试删除第一个<ul>after <h2 class="myclass"><strong>television</strong></h2>,如果可能的话,我<ul>只想在它之后出现 1 或 2 个元素时删除它<h2>

那可能吗?

标签: pythonbeautifulsoup

解决方案


<h2>您可以使用 CSS Selector:搜索第二个标签h2:nth-of-type(2),如果 thenext_siblingnext_sibling之后是<ul>标签,则使用以下方法将其从 HTML 中删除.decompose()

from bs4 import BeautifulSoup

html = """<p>whatever</p><h2 class="myclass"><strong>fruit</strong></h2><ul><li>something</li></ul><div>whatever</div><h2 class="myclass"><strong>television</strong></h2><div>whatever</div><ul><li>test</li></ul>"""
soup = BeautifulSoup(html, "html.parser")

looking_for = soup.select_one("h2:nth-of-type(2)")

if (
    looking_for.next_sibling.name == "ul"
    or looking_for.next_sibling.next_sibling.name == "ul"
):
    soup.select_one("ul:nth-of-type(2)").decompose()

print(soup.prettify())

输出:

<p>
 whatever
</p>
<h2 class="myclass">
 <strong>
  fruit
 </strong>
</h2>
<ul>
 <li>
  something
 </li>
</ul>
<div>
 whatever
</div>
<h2 class="myclass">
 <strong>
  television
 </strong>
</h2>
<div>
 whatever
</div>

推荐阅读