首页 > 解决方案 > 亚马逊使用 bs4 阻止 Python 3 抓取,请求

问题描述

几天前,当我运行这段代码时,它运行良好:

from bs4 import BeautifulSoup
import datetime
import requests

def getWeekMostRead(date):
    nonfiction_page = requests.get("https://www.amazon.com/charts/"+date.isoformat()+"/mostread/nonfiction")
    content = "amazon"+date.isoformat()+"_nonfiction.html"
    with open(content, "w", encoding="utf-8") as nf_file:
        print(nonfiction_page.content, file=nf_file)

    mostRead_nonfiction = BeautifulSoup(nonfiction_page.content, features="html.parser")

    nonfiction = mostRead_nonfiction.find_all("div", class_="kc-horizontal-rank-card")

    mostread = []
    for books in nonfiction:
        if books.find(class_="kc-rank-card-publisher") is None:
            mostread.append((
                books.find(class_="kc-rank-card-title").string.strip(),
                books.find(class_="kc-rank-card-author").string.strip(),
                "",
                books.find(class_="numeric-star-data").small.string.strip()
            ))
        else:
            mostread.append((
                books.find(class_="kc-rank-card-title").string.strip(),
                books.find(class_="kc-rank-card-author").string.strip(),
                books.find(class_="kc-rank-card-publisher").string.strip(),
                books.find(class_="numeric-star-data").small.string.strip()
            ))
    return mostread

mostread = []
date = datetime.date(2020,1,1)
while date >= datetime.date(2015,1,1):
    print("Scraped data from "+date.isoformat())
    mostread.extend(getWeekMostRead(date))
    date -= datetime.timedelta(7)
print("Currently saving scraped data to AmazonCharts.csv")
with open("AmazonCharts.csv", "w") as csv:
    counter = 0
    print("ID,Title,Author,Publisher,Rating", file=csv)
    for book in mostread:
        counter += 1
        print('AmazonCharts'+str(counter)+',"'+book[0]+'","'+book[1]+'","'+book[2]+'","'+book[3]+'"', file=csv)
    csv.close()

出于某种原因,今天我尝试再次运行它,并将其包含在返回的 HTML 文件中:

To discuss automated access to Amazon data please contact api-services-support@amazon.com.\r\n\r\nFor information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.

我知道亚马逊是一个沉重的反抓取数据(或者至少我从一些回复和线程中读到了这一点)。我尝试在代码中使用标题和延迟,但它不起作用。会有另一种方法来尝试这个吗?或者如果我应该等,我应该等多久?

标签: pythonpython-3.xweb-scrapingbeautifulsoup

解决方案


正如您所指出的,亚马逊非常反对刮擦。整个行业都是围绕从亚马逊抓取数据而建立的,亚马逊拥有自己的 API 访问权限来销售,因此阻止人们从他们的页面上随意抓取数据符合他们的最大利益。

Based on your code, I suspect you made too many requests too quickly and were IP banned. When scraping sites, it's usually best to scrape responsibly by not going too fast, rotating user agents, and rotating IPs through a proxy service.

To seem less programmatic, you can also try randomizing request timing to seem more human.

Even with all of that, you'll still likely hit issues with this. Amazon is not an easy site to reliably scrape.


推荐阅读