python - 亚马逊使用 bs4 阻止 Python 3 抓取,请求
问题描述
几天前,当我运行这段代码时,它运行良好:
from bs4 import BeautifulSoup
import datetime
import requests
def getWeekMostRead(date):
nonfiction_page = requests.get("https://www.amazon.com/charts/"+date.isoformat()+"/mostread/nonfiction")
content = "amazon"+date.isoformat()+"_nonfiction.html"
with open(content, "w", encoding="utf-8") as nf_file:
print(nonfiction_page.content, file=nf_file)
mostRead_nonfiction = BeautifulSoup(nonfiction_page.content, features="html.parser")
nonfiction = mostRead_nonfiction.find_all("div", class_="kc-horizontal-rank-card")
mostread = []
for books in nonfiction:
if books.find(class_="kc-rank-card-publisher") is None:
mostread.append((
books.find(class_="kc-rank-card-title").string.strip(),
books.find(class_="kc-rank-card-author").string.strip(),
"",
books.find(class_="numeric-star-data").small.string.strip()
))
else:
mostread.append((
books.find(class_="kc-rank-card-title").string.strip(),
books.find(class_="kc-rank-card-author").string.strip(),
books.find(class_="kc-rank-card-publisher").string.strip(),
books.find(class_="numeric-star-data").small.string.strip()
))
return mostread
mostread = []
date = datetime.date(2020,1,1)
while date >= datetime.date(2015,1,1):
print("Scraped data from "+date.isoformat())
mostread.extend(getWeekMostRead(date))
date -= datetime.timedelta(7)
print("Currently saving scraped data to AmazonCharts.csv")
with open("AmazonCharts.csv", "w") as csv:
counter = 0
print("ID,Title,Author,Publisher,Rating", file=csv)
for book in mostread:
counter += 1
print('AmazonCharts'+str(counter)+',"'+book[0]+'","'+book[1]+'","'+book[2]+'","'+book[3]+'"', file=csv)
csv.close()
出于某种原因,今天我尝试再次运行它,并将其包含在返回的 HTML 文件中:
To discuss automated access to Amazon data please contact api-services-support@amazon.com.\r\n\r\nFor information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_5_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_5_ac for advertising use cases.
我知道亚马逊是一个沉重的反抓取数据(或者至少我从一些回复和线程中读到了这一点)。我尝试在代码中使用标题和延迟,但它不起作用。会有另一种方法来尝试这个吗?或者如果我应该等,我应该等多久?
解决方案
正如您所指出的,亚马逊非常反对刮擦。整个行业都是围绕从亚马逊抓取数据而建立的,亚马逊拥有自己的 API 访问权限来销售,因此阻止人们从他们的页面上随意抓取数据符合他们的最大利益。
Based on your code, I suspect you made too many requests too quickly and were IP banned. When scraping sites, it's usually best to scrape responsibly by not going too fast, rotating user agents, and rotating IPs through a proxy service.
To seem less programmatic, you can also try randomizing request timing to seem more human.
Even with all of that, you'll still likely hit issues with this. Amazon is not an easy site to reliably scrape.
推荐阅读
- python - 使用 pandas 将 .dat 文件转换为 csv 格式的问题,需要将 1 列拆分为多列
- sorting - 我们可以通过从日志中解析,使用 datadog 的日期列以及创建的自定义时间戳列创建 Datadog 日志视图吗?
- python - 使用 sparkContext.wholeTextFiles 读取文件非常慢
- javascript - 上传文件时出错 React 本机 uisng 文档选择器
- swiftui - ReferenceFileDocument 和文档范围的书签
- reactjs - 在不同的链接上显示不同的图表(使用 react 和 axios)
- amazon-web-services - CodePipeline 在 ECS BlueGreen CodeDeploy 阶段失败
- swift - Xcode 12.4/Swift Playgrounds:代码不生成视图
- scikit-learn - AdaBoost 如何结合决策树得出预测?
- javascript - 想要将动态值传递给另一个函数,然后将其保存到数据库