首页 > 解决方案 > 如何在 python 中抓取某些 html 类

问题描述

我正在尝试抓取一个随机站点并从页面中获取具有特定类别的所有文本。

from bs4 import BeautifulSoup
import requests
sources = ['https://cnn.com']

for source in sources:
    page = requests.get(source)

    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find_all("div", class_='cd_content')
    for result in results:
        title = result.find('span', class_="cd__headline-text vid-left-enabled")
        print(title)

从我在网上找到的内容来看,这应该可以,但由于某种原因,它找不到任何东西,结果是空的。任何帮助是极大的赞赏。

标签: pythonhtmlpython-3.xbeautifulsoup

解决方案


检查网络调用后,您会看到页面是通过向以下位置发送GET请求动态加载的:

https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl

HTMLhtml在页面上的键中可用

import requests
from bs4 import BeautifulSoup


URL = "https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl"
response = requests.get(URL).json()["html"]
soup = BeautifulSoup(response, "html.parser")

for tag in soup.find_all(class_="cd__headline-text vid-left-enabled"):
    print(tag.text)

输出(截断):

This is the first Covid-19 vaccine in the US authorized for use in younger teens and adolescents
When the US could see Covid cases and deaths plummet 
'Truly, madly, deeply false': Keilar fact-checks Ron Johnson's vaccine claim
These are the states with the highest and lowest vaccination rates

推荐阅读