首页 > 解决方案 > BeautifulSoup 如何删除文本具有特定值的标签

问题描述

我正在尝试从 wikipedia 上抓取一些文章,并发现有一些我希望排除的条目。

在下面的情况下,我想排除a内容等于Archivedor的两个标签Wayback Machine。没有必要将文本作为因素。我看到 href 值也可用作 url 上的排除项archive.org/wiki/Wayback_Machine

<li id="cite_note-22">
    <span class="mw-cite-backlink">
        <b>
            <a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
        </b>
    </span> 
    <span class="reference-text">
        <a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a> 
        <a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
        <a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
    </span>
</li>

我尝试如下使用分解。但是发现这会返回错误'str' object has no attribute 'descendants'

removeWayback = BeautifulSoup.find_all('a', {'title':'Wayback Machine'})
removeArchive = BeautifulSoup.find(text="Archive")
removeWayback.decompose()
removeArchive.decompose()

removeWayback = BeautifulSoup.find_all('a', {'title':'Wayback Machine'})
File "/usr/local/lib/python3.8/site-packages/bs4/element.py", line 1780, in find_all generator = self.descendants
AttributeError: 'str' object has no attribute 'descendants'

我也尝试过使用exclude,但我有类似的问题。

有没有更好的方法来忽略这些链接?

标签: pythonbeautifulsoup

解决方案


你可以试试这个:

import re
from bs4 import BeautifulSoup

html = """<li id="cite_note-22">
    <span class="mw-cite-backlink">
        <b>
            <a href="#cite_ref-22" aria-label="Jump up" title="Jump up">^</a>
        </b>
    </span> 
    <span class="reference-text">
        <a rel="nofollow" class="external text" href="https://www.somelink.com">Article Text I want to keep</a> 
        <a rel="nofollow" class="external text" href="https://www.someotherlink.com">Archived</a>
        <a href="/wiki/Wayback_Machine" title="Wayback Machine">Wayback Machine</a>
    </span>
</li>"""

soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all(lambda t: t.name == 'a' and not re.search(r'Wayback|Archived|\^', t.text)):
    print(f"{anchor.text} - {anchor.get('href')}")

输出:

Article Text I want to keep - https://www.somelink.com

编辑回答评论:

您可以通过使用ofclass并将正则表达式条件放入循环中进行匹配。textattrs=.find_all()

soup = BeautifulSoup(html, "html.parser")
for anchor in soup.find_all("a", attrs={"class": "external text"}):
    if not re.search(r'Wayback|Archived', anchor.text):
        print(f"{anchor.text} - {anchor.get('href')}")

输出:

Article Text I want to keep - https://www.somelink.com

推荐阅读