python - 如何使用 Scrapy 重新安排 403 响应页面?
问题描述
有时我在使用 Scrapy 2.4.1 抓取页面时收到 403 响应。下载中间件设置为 5 次尝试,第 5 次尝试后它确实放弃了:
2021-02-06 01:44:17 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.url...> (failed 5 times): 403 Forbidden
2021-02-06 01:44:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.url...>: HTTP status code is not handled or not allowed
然而,文档告诉我,失败的页面将在爬网结束时重新安排,但事实并非如此。一旦 Scrapy 放弃该页面,它就不会再次重试。
一旦蜘蛛完成对所有常规(非失败)页面的爬取,在抓取过程中收集失败的页面并在最后重新安排。
https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/retry.html
我的问题是:如何配置中间件,使其在失败后不会立即重试这些页面,而是继续使用另一个 URL 并在抓取其余页面后重新安排它们?
解决方案
为了避免 403 错误,我使用了不同的用户代理,如下所示:
import random
def get_header():
headers_list = [
# Firefox 77 Mac
{
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Firefox 77 Windows
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
# Chrome 83 Mac
{
"Connection": "keep-alive",
"DNT": "1",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "none",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
},
# Chrome 83 Windows
{
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Sec-Fetch-Site": "same-origin",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-User": "?1",
"Sec-Fetch-Dest": "document",
"Referer": "https://www.google.com/",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-US,en;q=0.9"
}
]
return random.choice(headers_list)
然后,在你的 main 函数中,像这样调用它:
header = get_header()
response = requests.get(url, headers=header)
对我来说,这在大多数情况下都避免了 403 错误。
推荐阅读
- winapi - wslapi.lib 文件在哪里?
- java - WebView Android 无法正常工作
- django - Django 和 Bootstrap:仅在大型设备上使用卡
- c++ - 如何在并行实现中减少用 C++ 和 MPI 编码的 Barnes-Hut N-Body 的 IPC 时间
- laravel - 如何为模型设置多个作者类型
- java - 在jar中查找外部资源
- mysql - 是否可以将数组添加到 sql 列?
- r - 创建函数以使用带有 ggplot2 的循环
- python - 如何在 tkinter 中获取画布的尺寸?
- javascript - 试图学习 tensorflow.js,但需要一个更简单的例子,比如 Brain.js