首页 > 解决方案 > 从 BeautifulSoup 中没有类的 span 标签中提取文本

问题描述

我正在尝试从网站中提取数据以完成一个小型数据分析项目。这是我正在处理的 HTML 源代码(我想从中提取数据的所有 div 都具有完全相同的结构)。

url = "https://www.rystadenergy.com/newsevents/news/press-releases/"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")


   <div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="Oil Markets" data-month="11" data-year="2020">
     <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/prices-at-stake-if-opec-increases-output-in-january-a-200-million-barrel-glut-will-build-through-may/">
      <small class="mb-3 d-flex flex-wrap justify-content-between">
       <time datetime="2020-11-30">
        November 30, 2020
       </time>
       <span>
        Oil Markets
       </span>
      </small>
      <h5 class="mb-0">
       Prices at stake: If OPEC+ increases output in January, a 200 million-barrel glut will build through May
      </h5>
     </a>
    </div>

幸运的是,我成功提取了文章的标题和发表日期。我首先创建bs4.element.ResultSet然后编写了一个循环,以便按如下方式遍历每个日期,并且它工作正常(文章标题也是如此)。

divs = soup.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

dates = []
for container in divs:
    date = container.find('time')
    dates.append(date['datetime'])

然而,当我试图提取每篇文章的类别时,它位于<span></span>(在我的情况下为石油市场)之间,我有一个error that 'NoneType' object has no attribute 'text. 我以前这样做的代码是:

for container in divs:
    topic = container.find('span').text
    topics.append(topic)  

这里奇怪的是,当 I 时print(topics),我有一个列表包含比实际元素更多的元素(几乎 800 个元素,有时甚至更多),并且元素混合在一起,同时包含字符串和 bs4 元素标签。这是我得到的列表的快照:

</span>, <span> E&amp;P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Gas Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> E&amp;P </span>, <span> Oil Markets </span>, <span> E&amp;P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',

我的目标是将类别提取为字符串列表(它们应该是 207 个类别的组合),以便稍后在数据框中填充它们以及日期和标题。

我已经在这里、这里和这里尝试了解决方案没有成功。我想知道是否有人可以帮助我解决这个问题。

标签: pythonweb-scrapingbeautifulsoup

解决方案


您的代码很好,您只需添加一个try..catch以避免在某些没有类别的文章上崩溃。

下面的片段说明了它:

from bs4 import BeautifulSoup
import requests

html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')

divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')

for container in divs:
    topic = container.find('span')
    if not topic :
        print(container)

输出:

<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>

如您所见,没有span元素。

所以在你的情况下:

topics = []
for container in divs:
    try:
        topic = container.find('span').text.strip()
    except:
        topic = ''
    finally:
        topics.append(topic)

请注意,这只是一种方法:)


推荐阅读