python - 从 BeautifulSoup 中没有类的 span 标签中提取文本
问题描述
我正在尝试从网站中提取数据以完成一个小型数据分析项目。这是我正在处理的 HTML 源代码(我想从中提取数据的所有 div 都具有完全相同的结构)。
url = "https://www.rystadenergy.com/newsevents/news/press-releases/"
results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="Oil Markets" data-month="11" data-year="2020">
<a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/prices-at-stake-if-opec-increases-output-in-january-a-200-million-barrel-glut-will-build-through-may/">
<small class="mb-3 d-flex flex-wrap justify-content-between">
<time datetime="2020-11-30">
November 30, 2020
</time>
<span>
Oil Markets
</span>
</small>
<h5 class="mb-0">
Prices at stake: If OPEC+ increases output in January, a 200 million-barrel glut will build through May
</h5>
</a>
</div>
幸运的是,我成功提取了文章的标题和发表日期。我首先创建bs4.element.ResultSet
然后编写了一个循环,以便按如下方式遍历每个日期,并且它工作正常(文章标题也是如此)。
divs = soup.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')
dates = []
for container in divs:
date = container.find('time')
dates.append(date['datetime'])
然而,当我试图提取每篇文章的类别时,它位于<span></span>
(在我的情况下为石油市场)之间,我有一个error that 'NoneType' object has no attribute 'text
. 我以前这样做的代码是:
for container in divs:
topic = container.find('span').text
topics.append(topic)
这里奇怪的是,当 I 时print(topics)
,我有一个列表包含比实际元素更多的元素(几乎 800 个元素,有时甚至更多),并且元素混合在一起,同时包含字符串和 bs4 元素标签。这是我得到的列表的快照:
</span>, <span> E&P, Oil Markets, Supply Chain </span>, <span> Oil Markets, Gas Markets </span>, <span> Supply Chain </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Shale </span>, <span> Corporate </span>, <span> E&P </span>, <span> Oil Markets </span>, <span> Supply Chain, Other, Renewables </span>, <span> Gas Markets </span>, <span> Oil Markets </span>, <span> Gas Markets </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Gas Markets </span>, <span> E&P </span>, <span> Supply Chain </span>, <span> Shale </span>, None, <span> Corporate </span>, <span> Shale </span>, None, <span> Renewables </span>, <span> Renewables </span>, <span> Renewables </span>, <span> E&P </span>, <span> E&P </span>, <span> E&P </span>, <span> E&P </span>, <span> Oil Markets </span>, <span> E&P </span>, <span> Supply Chain </span>, ' Oil Markets ', ' Oil Markets ', ' Supply Chain, Renewables ', ' Oil Markets ', ' Renewables ', ' E&P ', ' Renewables ', ' Supply Chain ', ' Shale ', ' E&P ', ' Shale ', ' Gas Markets ', ' Gas Markets ', ' Supply Chain ', ' Oil Markets ', ' Shale ', ' Oil Markets ', ' Corporate, Oil Markets, Other ', ' Shale ', ' Renewables ', ' Shale ', ' Supply Chain ',
我的目标是将类别提取为字符串列表(它们应该是 207 个类别的组合),以便稍后在数据框中填充它们以及日期和标题。
解决方案
您的代码很好,您只需添加一个try..catch
以避免在某些没有类别的文章上崩溃。
下面的片段说明了它:
from bs4 import BeautifulSoup
import requests
html = BeautifulSoup(requests.get('https://www.rystadenergy.com/newsevents/news/press-releases/').text, 'html.parser')
divs = html.find_all('div', class_='col-12 col-md-6 col-lg-4 mt-3 news-events-list__item')
for container in divs:
topic = container.find('span')
if not topic :
print(container)
输出:
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/winners-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-28">January 28, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces winners for Gullkronen 2020 </h5> </a> </div>
<div class="col-12 col-md-6 col-lg-4 mt-3 news-events-list__item" data-category="" data-month="1" data-year="2020"> <a class="d-block bg-light p-3 text-body text-decoration-none h-100" href="/newsevents/news/press-releases/nominees-gullkronen-2020/"> <small class="mb-3 d-flex flex-wrap justify-content-between"> <time datetime="2020-01-23">January 23, 2020</time> </small> <h5 class="mb-0"> Rystad Energy announces nominees for Gullkronen 2020 </h5> </a> </div>
如您所见,没有span
元素。
所以在你的情况下:
topics = []
for container in divs:
try:
topic = container.find('span').text.strip()
except:
topic = ''
finally:
topics.append(topic)
请注意,这只是一种方法:)
推荐阅读
- arrays - Julia中数组内的成对比较
- javascript - 编写一个javascript乘法函数,它将返回两个单独的结果
- javascript - react-responsive-carousel:如何在保持纵横比和最小化裁剪的同时显示纵向图像?
- java - 将整数添加到 ArrayList 直到到达行尾
- css - 网站在移动设备中没有响应,但在调整大小时响应
- javascript - 表达 react redux 从 promise 返回数据作为数组
- networking - 未设置无线接口的队列规则
- google-maps - 谷歌地图 URL 链接到带有标记的卫星地图并尊重缩放
- asp.net-core - 在不使用 Bower 或 npm 的情况下使用 Yarn 安装 Nuget 包
- jenkins - 我们如何在詹金斯作业中传递运行时参数,如密码