首页 > 解决方案 > 使用 Python 中的 BeautifulSoup 从新闻网站主页抓取标题

问题描述

我正在尝试使用 BeautifulSoup 从多个新闻网站的主页中提取标题。我正在学习 Python,但对 HTML、Javascript 的 CSS 了解不多,所以我在 Chrome 上使用 Inspect 进行了一些试验和错误。这是我在《纽约时报》网页上为此编写的代码:

import requests from bs4
import BeautifulSoup


url = "https://www.nytimes.com/"
r = requests.get(url)
r_html = r.text
soup = BeautifulSoup(r_html, features="html.parser")
headlines = soup.find_all(class_="css-1vynn0q esl82me3")

for item in headlines:
    if len(item.contents) == 1:
        print(item.text)
    elif len(item.contents) == 2:
        print(item.contents[1].text)

以下是我的问题:

  1. 当我计划为多个新闻网站执行此操作时,是否有比您可以建议的这种方法更好的解决方案?

  2. 我注意到自从我写了这段代码后 CSS 标签已经改变了,因此我不得不更新它。是否有任何解决方案不需要我每次更新标签时都更改代码?

标签: pythonweb-scrapingbeautifulsouppython-requests

解决方案


这是可能的,因为您可以<script>在 html 中找到标签,然后将其解析为 json 格式。它可能不适用于每个新闻网站,因为很可能会有不同的标签/代码来识别标题标签,但您可以有一个通用的工作代码来提取这些标题,即使它们稍后更新。

像往常一样解析 html:

import requests 
from bs4 import BeautifulSoup
import json

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'}

url = "https://www.nytimes.com/"
r = requests.get(url, headers=headers)
r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")

然后找到所有<script>标签。我们想要的以 text 开头window.__preloadedData = ,所以我们只想从它找到的带有<script>标签的 14 个元素中搜索出来:

scripts = soup.find_all('script')
for script in scripts:
    if 'preloadedData' in script.text:
        jsonStr = script.text

找到后,我们将其存储为jsonStr,然后开始修剪字符串的开头和结尾部分,将其更改为纯 json 格式,然后可以使用 加载json.loads(),并将其存储为我们的jsonObj:

    jsonStr = jsonStr.split('=', 1)[1].strip()
    jsonStr = jsonStr.rsplit(';', 1)[0]
    jsonObj = json.loads(jsonStr)

一旦我们有了jsonObj,我们将遍历结构中的 key:values 以找到与headlinejson 对象中的键关联的值:

for ele, v in jsonObj['initialState'].items():
    try:
        if v['headline']:
            print(v['headline'])
    except:
        continue

完整代码:

我还添加了一个日期时间元素,因为您可能希望存储它以查看特定日期/时间的标题是什么,因为它稍后会更新。

import requests 
from bs4 import BeautifulSoup
import json
import datetime

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Safari/537.36'}

url = "https://www.nytimes.com/"
r = requests.get(url, headers=headers)
now = datetime.datetime.now()
now = now.strftime('%A, %B %d, %Y  %I:%M %p')

r_html = r.text
soup = BeautifulSoup(r_html, "html.parser")

scripts = soup.find_all('script')
for script in scripts:
    if 'preloadedData' in script.text:
        jsonStr = script.text
        jsonStr = jsonStr.split('=', 1)[1].strip()
        jsonStr = jsonStr.rsplit(';', 1)[0]
        jsonObj = json.loads(jsonStr)
 

print ('%s\nHeadlines\n%s\n' %(url, now))
count = 1
for ele, v in jsonObj['initialState'].items():
    try:
        if v['headline'] and v['__typename'] == 'PromotionalProperties':
            print('Headline %s: %s' %(count, v['headline']))
            count += 1
    except:
        continue

输出:

https://www.nytimes.com/
Headlines
Thursday, March 07, 2019  11:50 AM

Headline 1: The Trade Deficit Set a New Record. For Trump, That’s a Failure.
Headline 2: Rules Relaxed for Banks, Giving Wall Street a Big Win
Headline 3: Biden’s Candidacy Plan is Almost Complete. Democrats Are Impatient.
Headline 4: Why Did Four Top Democrats Just Say No to 2020?
Headline 5: Why Birthrates Among Hispanic Americans Have Plummeted
Headline 6: The Top 25 Songs That Matter Right Now
Headline 7: Cohen Says Papers Prove His Lies Were Aided by Trump Lawyers
Headline 8: Paul Manafort to Be Sentenced Thursday in 1 of 2 Cases Against Him
Headline 9: Trump’s Lawyer Says Several Have Sought Presidential Pardons
Headline 10: Senator Says She Was Raped in the Military, Describing a Broken System
Headline 11: Your Thursday Briefing
Headline 12: Listen to ‘The Daily’
Headline 13: Listen: ‘Modern Love’ Podcast
Headline 14: In the ‘DealBook’ Newsletter
Headline 15: What if the Mueller Report Demands Bold Action?
Headline 16: Ilhan Omar Knows Exactly What She Is Doing
Headline 17: How to Think About Taxing and Spending Like a Swede
Headline 18: Even Google Can No Longer Hide Its Gender Pay Gap
Headline 19: Questions For and About Jared Kushner
Headline 20: The Big Race: It’s Time for a Rhyme
Headline 21: Listen to ‘The Argument’: How Does the Catholic Church Redeem Itself?
Headline 22: We Will Survive. Probably.
Headline 23: A Peace Plan for India and Pakistan Already Exists
Headline 24: Ilhan Omar, Aipac and Me
Headline 25: The India-Pakistan Conflict Was a Parade of Lies
Headline 26: Seven Buds for Seven Brothers
Headline 27: This Tech Makes D.I.Y. Key Duplication Easy. Maybe Too Easy.
Headline 28: 36 Hours in St. Augustine

推荐阅读