首页 > 解决方案 > 如何使用 Scrapy 重新安排 403 响应页面?

问题描述

有时我在使用 Scrapy 2.4.1 抓取页面时收到 403 响应。下载中间件设置为 5 次尝试,第 5 次尝试后它确实放弃了:

2021-02-06 01:44:17 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://www.url...> (failed 5 times): 403 Forbidden
2021-02-06 01:44:17 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.url...>: HTTP status code is not handled or not allowed

然而,文档告诉我,失败的页面将在爬网结束时重新安排,但事实并非如此。一旦 Scrapy 放弃该页面,它就不会再次重试。

一旦蜘蛛完成对所有常规(非失败)页面的爬取,在抓取过程中收集失败的页面并在最后重新安排。

https://docs.scrapy.org/en/latest/_modules/scrapy/downloadermiddlewares/retry.html

我的问题是:如何配置中间件,使其在失败后不会立即重试这些页面,而是继续使用另一个 URL 并在抓取其余页面后重新安排它们?

标签: pythonscrapy

解决方案


为了避免 403 错误,我使用了不同的用户代理,如下所示:

    import random
    def get_header():
    headers_list = [
    # Firefox 77 Mac
     {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    },
    # Firefox 77 Windows
    {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "Referer": "https://www.google.com/",
        "DNT": "1",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1"
    },
    # Chrome 83 Mac
    {
        "Connection": "keep-alive",
        "DNT": "1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Dest": "document",
        "Referer": "https://www.google.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8"
    },
    # Chrome 83 Windows
    {
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-User": "?1",
        "Sec-Fetch-Dest": "document",
        "Referer": "https://www.google.com/",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "en-US,en;q=0.9"
    }
]

    return random.choice(headers_list)

然后,在你的 main 函数中,像这样调用它:

header = get_header()
response = requests.get(url, headers=header)

对我来说,这在大多数情况下都避免了 403 错误。


推荐阅读