首页 > 解决方案 > 如何使用 Python 3.7 中的 Beautifulsoup 从 USA Today 报纸的文章中收集内容?

问题描述

我正在收集《今日美国报》的日期、标题和内容。我可以获得日期、标题甚至内容,但除了内容之外,我还得到了一些不需要的东西。我不知道我应该在我的代码中更改什么以仅获取内容(文章)?

import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag

url = 'https://www.usatoday.com/search/?q=cybersecurity&page={}'
pages = 72

for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")

    for item in soup.find_all("a", {"class": "gnt_se_a"}, href=True):
        _href = item.get("href")
        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
                resp = requests.get("https://www.usatoday.com"+_href)
            except Exception as e:
                continue

        sauce = BeautifulSoup(resp.text,"lxml")
        dateTag = sauce.find("span",{"class": "asset-metabar-time asset-metabar-item nobyline"})
        titleTag = sauce.find("h1", {"class": "asset-headline speakable-headline"})
        contentTag = sauce.find("div", {"class": "asset-double-wide double-wide p402_premium"})

        date = None
        title = None
        content = None

        if isinstance(dateTag,Tag):
            date = dateTag.get_text().strip()

        if isinstance(titleTag,Tag):
            title = titleTag.get_text().strip()

        if isinstance(contentTag,Tag):
            content = contentTag.get_text().strip()

        print(f'{date}\n {title}\n {content}\n')

        time.sleep(3)

我期待每篇文章的日期、标题和内容。

标签: web-scrapingbeautifulsouppython-3.7

解决方案


我尝试通过以下方式查找内容

contentTag = sauce.find_all('p',{"class": "p-text"})

内容的条件是

if isinstance(contentTag,list):
    content = []
    for c in contentTag:
        content.append(c.get_text().strip())
    content = ' '.join(content)

有用。


推荐阅读