python - 使用 Python 中的 BeautifulSoup 从新闻网站主页抓取标题
问题描述
我正在尝试使用 BeautifulSoup 从多个新闻网站的主页中提取标题。我正在学习 Python,但对 HTML、Javascript 的 CSS 了解不多,所以我在 Chrome 上使用 Inspect 进行了一些试验和错误。这是我在《纽约时报》网页上为此编写的代码:
import requests from bs4
import BeautifulSoup
url = "https://www.nytimes.com/"
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, features="html.parser")
headlines = soup.find_all(class_="css-1vynn0q esl82me3")
for item in headlines:
if len(item.contents) == 1:
print(item.text)
elif len(item.contents) == 2:
print(item.contents[1].text)
以下是我的问题:
当我计划为多个新闻网站执行此操作时,是否有比您可以建议的这种方法更好的解决方案?
我注意到自从我写了这段代码后 CSS 标签已经改变了,因此我不得不更新它。是否有任何解决方案不需要我每次更新标签时都更改代码?
解决方案
这是可能的,因为您可以<script>
在 html 中找到标签,然后将其解析为 json 格式。它可能不适用于每个新闻网站,因为很可能会有不同的标签/代码来识别标题标签,但您可以有一个通用的工作代码来提取这些标题,即使它们稍后更新。
像往常一样解析 html:
import requests
from bs4 import BeautifulSoup
import json
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'}
url = "https://www.nytimes.com/"
r = requests.get(url, headers=headers)
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")
然后找到所有<script>
标签。我们想要的以 text 开头window.__preloadedData =
,所以我们只想从它找到的带有<script>
标签的 14 个元素中搜索出来:
scripts = soup.find_all('script')
for script in scripts:
if 'preloadedData' in script.text:
jsonStr = script.text
找到后,我们将其存储为jsonStr
,然后开始修剪字符串的开头和结尾部分,将其更改为纯 json 格式,然后可以使用 加载json.loads()
,并将其存储为我们的jsonObj
:
jsonStr = jsonStr.split('=', 1)[1].strip()
jsonStr = jsonStr.rsplit(';', 1)[0]
jsonObj = json.loads(jsonStr)
一旦我们有了jsonObj
,我们将遍历结构中的 key:values 以找到与headline
json 对象中的键关联的值:
for ele, v in jsonObj['initialState'].items():
try:
if v['headline']:
print(v['headline'])
except:
continue
完整代码:
我还添加了一个日期时间元素,因为您可能希望存储它以查看特定日期/时间的标题是什么,因为它稍后会更新。
import requests
from bs4 import BeautifulSoup
import json
import datetime
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'}
url = "https://www.nytimes.com/"
r = requests.get(url, headers=headers)
now = datetime.datetime.now()
now = now.strftime('%A, %B %d, %Y %I:%M %p')
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")
scripts = soup.find_all('script')
for script in scripts:
if 'preloadedData' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split('=', 1)[1].strip()
jsonStr = jsonStr.rsplit(';', 1)[0]
jsonObj = json.loads(jsonStr)
print ('%s\nHeadlines\n%s\n' %(url, now))
count = 1
for ele, v in jsonObj['initialState'].items():
try:
if v['headline'] and v['__typename'] == 'PromotionalProperties':
print('Headline %s: %s' %(count, v['headline']))
count += 1
except:
continue
输出:
https://www.nytimes.com/
Headlines
Thursday, March 07, 2019 11:50 AM
Headline 1: The Trade Deficit Set a New Record. For Trump, That’s a Failure.
Headline 2: Rules Relaxed for Banks, Giving Wall Street a Big Win
Headline 3: Biden’s Candidacy Plan is Almost Complete. Democrats Are Impatient.
Headline 4: Why Did Four Top Democrats Just Say No to 2020?
Headline 5: Why Birthrates Among Hispanic Americans Have Plummeted
Headline 6: The Top 25 Songs That Matter Right Now
Headline 7: Cohen Says Papers Prove His Lies Were Aided by Trump Lawyers
Headline 8: Paul Manafort to Be Sentenced Thursday in 1 of 2 Cases Against Him
Headline 9: Trump’s Lawyer Says Several Have Sought Presidential Pardons
Headline 10: Senator Says She Was Raped in the Military, Describing a Broken System
Headline 11: Your Thursday Briefing
Headline 12: Listen to ‘The Daily’
Headline 13: Listen: ‘Modern Love’ Podcast
Headline 14: In the ‘DealBook’ Newsletter
Headline 15: What if the Mueller Report Demands Bold Action?
Headline 16: Ilhan Omar Knows Exactly What She Is Doing
Headline 17: How to Think About Taxing and Spending Like a Swede
Headline 18: Even Google Can No Longer Hide Its Gender Pay Gap
Headline 19: Questions For and About Jared Kushner
Headline 20: The Big Race: It’s Time for a Rhyme
Headline 21: Listen to ‘The Argument’: How Does the Catholic Church Redeem Itself?
Headline 22: We Will Survive. Probably.
Headline 23: A Peace Plan for India and Pakistan Already Exists
Headline 24: Ilhan Omar, Aipac and Me
Headline 25: The India-Pakistan Conflict Was a Parade of Lies
Headline 26: Seven Buds for Seven Brothers
Headline 27: This Tech Makes D.I.Y. Key Duplication Easy. Maybe Too Easy.
Headline 28: 36 Hours in St. Augustine
推荐阅读
- php - php 代码有效,但警告:未定义的数组键 php
- node.js - 如何解决 sequelize.import 不是函数?
- python - 获取用户使用 Selenium 按下的按钮
- modelica - Modelica 在密闭容器(高压锅)内将水蒸发到空气中
- alexa - Alexa 是否支持美国以外的业务,我们计划增加对西班牙语的支持
- python - 一个简单(但很长)的 SQLAlchemy filter() 运行速度极慢(SQL Server)
- powershell - 测量多个属性?
- python - Scikit learn 生成一个始终保证准确预测的功能
- python - 我将如何创建一个全球排行榜经济机器人。- Discord.py 重写
- python - 用于训练模型的管道上的滚动平均值