python - 在 python 中,如何在代码执行良好(退出代码 0)但没有结果(没有打印)时修复代码?
问题描述
我正在尝试抓取纽约时报的网页。我的代码运行良好,因为它显示退出代码 0 但没有给出任何结果。
import time
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com/search?endDate=20190331&query=cybersecurity&sort=newest&startDate=20180401={}'
pages = [0]
for page in pages:
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#search-results li > a"):
resp = requests.get(item.get("href"))
sauce = BeautifulSoup(resp.text, "lxml")
date = sauce.select(".css-1vkm6nb ehdk2mb0 h1")
date = date.text
print(date)
time.sleep(3)
使用此代码,我希望从每篇文章中获取发布日期。
解决方案
不错的尝试——你已经很接近了。问题是选择器:
#search-results
要求一个不存在的 id。元素是 a<ol data-testid="search-results">
,所以我们需要其他方法来获取这个锚标记。.css-1vkm6nb ehdk2mb0 h1
没有多大意义。它要求一个元素h1
内部的ehdk2mb0
元素,该元素位于具有 class 的元素内部.css-1vkm6nb
。页面上的实际内容是一个<h1 class="css-1vkm6nb ehdk2mb0">
元素。用 选择此项h1.css-1vkm6nb.ehdk2mb0
。
话虽如此,这不是你要的时间数据——它是标题。我们可以<time>
用一个简单的sauce.find("time")
.
完整示例:
import requests
from bs4 import BeautifulSoup
base = "https://www.nytimes.com"
url = "https://www.nytimes.com/search?endDate=20190331&query=cybersecurity&sort=newest&startDate=20180401={}"
pages = [0]
for page in pages:
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for link in soup.select(".css-138we14 a"):
resp = requests.get(base + link.get("href"))
sauce = BeautifulSoup(resp.text, "lxml")
title = sauce.select_one("h1.css-1j5ig2m.e1h9rw200")
time = sauce.find("time")
print(time.text, title.text.encode("utf-8"))
输出:
March 30, 2019 b'Bezos\xe2\x80\x99 Security Consultant Accuses Saudis of Hacking the Amazon C.E.O.\xe2\x80\x99s Phone'
March 29, 2019 b'In Ukraine, Russia Tests a New Facebook Tactic in Election Tampering'
March 28, 2019 b'Huawei Shrugs Off U.S. Clampdown With a $100 Billion Year'
March 28, 2019 b'N.S.A. Contractor Arrested in Biggest Breach of U.S. Secrets Pleads Guilty'
March 28, 2019 b'Grindr Is Owned by a Chinese Firm, and the U.S. Is Trying to Force It to Sell'
March 28, 2019 b'DealBook Briefing: Saudi Arabia Wanted Cash. Aramco Just Obliged.'
March 28, 2019 b'Huawei Security \xe2\x80\x98Defects\xe2\x80\x99 Are Found by British Authorities'
March 25, 2019 b'As Special Counsel, Mueller Kept Such a Low Profile He Seemed Almost Invisible'
March 21, 2019 b'Quotation of the Day: In New Age of Digital Warfare, Spies for Any Nation\xe2\x80\x99s Budget'
March 21, 2019 b'Coast Guard\xe2\x80\x99s Top Officer Pledges \xe2\x80\x98Dedicated Campaign\xe2\x80\x99 to Improve Diversity'
推荐阅读
- reactjs - 为什么 recipe.map 不是函数?
- java - 支付宝抓包授权:立即取款
- javascript - 将对象和函数传递给 javascript 类并返回 onSuccess 和 onError
- sql - 我已经写了 mysql 查询想要在弹性搜索查询中进行相同的转换
- node.js - 节点:事件:304 错误:监听 EADDRINUSE:地址已在使用 :::5000
- html - 我想改变箭头的位置,把它移到左边
- dataframe - PySyft AttributeError:“DataFrame”对象在从 csv 读取数据时没有属性“federate”
- linux - 已解决:sudo 上的 PAM 模块 bash 脚本会破坏 Zenity - /bin/bash 失败:退出代码 126 - /bin/bash 失败:退出代码 126
- python - 授予权限时Python Discord Bot NoneType错误
- react-native - 反应本机初始化失败