python - 删除python webscraping循环结果中不需要的元素
问题描述
我目前正在尝试使用以下代码从网页中提取文本和标签(主题):
Texts = []
Topics = []
url = 'https://www.unep.org/news-and-stories/story/yes-climate-change-driving-wildfires'
response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'})
if response.ok:
soup = BeautifulSoup(response.text,'lxml')
txt = soup.findAll('div', {'class': 'para_content_text'})
for div in txt:
p = div.findAll('p')
Texts.append(p)
print(Texts)
top = soup.find('div', {'class': 'article_tags_topics'})
a = top.findAll('a')
Topics.append(a)
print(Topics)
没有代码问题,但这里是我从之前的代码中获得的摘录:
</p>, <p><strong>UNEP:</strong> And this is bad news?</p>, <p><strong>NH:</strong> This is bad news. This is bad for our health, for our wallet and for the fabric of society.</p>, <p><strong>UNEP:</strong> The world is heading towards a global average temperature that’s 3<strong>°</strong>C to 4<strong>°</strong>C higher than it was before the industrial revolution. For many people, that might not seem like a lot. What do you say to them?</p>, <p><strong>NH:</strong> Just think about your own body. When your temperature goes up from 36.7°C (98°F) to 37.7°C (100°F), you’ll probably consider taking the day off. If it goes 1.5°C above normal, you’re staying home for sure. If you add 3°C, people who are older and have preexisting conditions – they may die. The tolerances are just as tight for the planet.</p>]]
[[<a href="/explore-topics/forests">Forests</a>, <a href="/explore-topics/climate-change">Climate change</a>]]
当我正在寻找“干净”的文本结果时,我尝试在循环中添加以下代码行,以便仅获取文本:
p = p.text
但我得到了:
AttributeError:ResultSet 对象没有属性“文本”。您可能将项目列表视为单个项目。当您打算调用 find() 时,您是否调用了 find_all()?
我还注意到,对于主题结果,我得到了不需要的 URL,我只想获得 Forest 和结果(它们之间没有逗号)。
知道我可以在代码中添加什么以获得干净的文本和主题吗?
解决方案
发生这种情况是因为p
它是一个ResultSet
对象。您可以通过运行以下命令来查看:
print(type(Texts[0]))
输出:
<class 'bs4.element.ResultSet'>
要获取实际文本,您可以直接处理 each 中的每个项目ResultSet
:
for result in Texts:
for item in result:
print(item.text)
输出:
As wildfires sweep across the western United States, taking lives, destroying homes and blanketing the country in smoke, Niklas Hagelberg has a sobering message: this could be America’s new normal.
......
甚至使用列表推导:
full_text = '\n'.join([item.text for result in Texts for item in result])
推荐阅读
- php - 在php中将数组转换为水平
- mysql - 如何只过滤 MySQL 中每个用户的最高分?
- git - 防止在同一分支的两个目录中提交
- python - 在 Dash 的下拉列表中添加“全选”选项
- three.js - 如何使用 CANNON.js 和 THREE.js 让球在 Box 内反弹?
- angular - primeng :冻结列不适用于动态列
- android - 如何使用 ImageView 在 Widget 上创建边框?
- python - 使用数据仓库的连接详细信息
- javascript - ReactJS Fetch POST 导致不需要的刷新
- android - 无法使用我的 android 应用程序(Kotlin)中的电话支付完成 upi 交易