scrapy - 亚马逊评论:列表索引超出范围
问题描述
我想抓取亚马逊kindle paperwhite的客户评论。
我知道虽然亚马逊可能会说有 5900 条评论,但只能访问其中的 5000 条。(在 page=500 之后不再显示评论,每页 10 条评论)。
对于前几页,我的蜘蛛每页返回 10 条评论,但后来缩小到只有一两条。这导致只有大约 1300 条评论。添加变量“helpul”和“verified”的数据似乎有问题。两者都抛出以下错误:
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
任何帮助将不胜感激!
如果变量为空或包含列表,我尝试实现 if 语句,但它不起作用。
我的蜘蛛 amazon_reviews.py:
import scrapy
from scrapy.extensions.throttle import AutoThrottle
class AmazonReviewsSpider(scrapy.Spider):
name = 'amazon_reviews'
allowed_domains = ['amazon.com']
myBaseUrl = "https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber="
start_urls=[]
# Creating list of urls to be scraped by appending page number a the end of base url
for i in range(1,550):
start_urls.append(myBaseUrl+str(i))
def parse(self, response):
data = response.css('#cm_cr-review_list')
# Collecting various data
star_rating = data.css('.review-rating')
title = data.css('.review-title')
text = data.css('.review-text')
date = data.css('.review-date'))
# Number how many people thought the review was helpful.
helpful = response.xpath('.//span[@data-hook="helpful-vote-statement"]//text()').extract()
verified = response.xpath('.//span[@data-hook="avp-badge"]//text()').extract()
# I scrape more information, but deleted it here not to make the code too big
# yielding the scraped results
for review in star_rating:
yield{'ASIN': 'B07CXG6C9W',
#'ID': ''.join(id.xpath('.//text()').extract()),
'stars': ''.join(review.xpath('.//text()').extract_first()),
'title': ''.join(title[count].xpath(".//text()").extract_first()),
'text': ''.join(text[count].xpath(".//text()").extract_first()),
'date': ''.join(date[count].xpath(".//text()").extract_first()),
### There seems to be a problem with adding these two, as I get 5000 reviews back if I delete them. ###
'verified purchase': ''.join(verified[count]),
'helpful': ''.join(helpful[count])
}
count=count+1
我的 settings.py :
AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_TIMEOUT = 180
REDIRECT_ENABLED = False
#DOWNLOAD_DELAY =5.0
RANDOMIZE_DOWNLOAD_DELAY = True
数据的提取工作正常。我得到的评论有完整和准确的信息。只是我得到的评论数量太少了。
当我使用以下命令运行蜘蛛时:
runspider amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py -o reviews.csv
控制台上的输出如下所示:
2019-04-22 11:54:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=164> (referer: None)
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'BRANDI', 'title': 'Bookworms rejoice!', 'text': "The (...) 5 STARS! ", 'date': 'December 7, 2018'}
2019-04-22 11:54:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161>
{'ASIN': 'B07CXG6C9W', 'stars': '5.0 out of 5 stars', 'username': 'Doug Stender', 'title': 'As good as adverised', 'text': 'I read (...) mazon...', 'date': 'January 8, 2019'}
2019-04-22 11:54:41 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.amazon.com/Kindle-Paperwhite-Waterproof-Storage-Special/product-reviews/B07CXG6C9W/ref=cm_cr_dp_d_show_all_top?ie=UTF8&reviewerType=all_reviews&pageNumber=161> (referer: None)
Traceback (most recent call last):
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
for x in result:
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\Anaconda3\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\John\OneDrive\Dokumente\Uni\05_SS 19\Masterarbeit\Code\Scrapy\amazon_reviews_scraping_test\amazon_reviews_scraping_test\spiders\amazon_reviews.py", line 78, in parse
'helpful': ''.join(helpful[count]),
IndexError: list index out of range
解决方案
事实证明,如果评论没有“已验证”标签,或者没有人评论它,则 scrapy 正在寻找的 html 部分不存在,因此没有项目被添加到列表中,这使得“已验证”和“评论”列表比其他列表短。由于这个错误,列表中的所有项目都被删除并且没有添加到我的 csv 文件中。下面的简单修复检查列表是否与其他列表一样长工作正常:)
编辑: 使用此修复程序时,可能会将值分配给错误的评论,因为它总是添加到列表的末尾。如果您想安全起见,请不要刮掉已验证的标签或将整个列表替换为“Na”或其他表明该值不清楚的内容。
helpful = response.xpath('.//span[@data-hook="helpful-vote-statement"]//text()').extract()
while len(helpful) != len(date):
helpful.append("0 people found this helpful")
推荐阅读
- jquery - 基于jquery过滤项目
- shader - HLSL 整数纹理坐标
- apache-kafka - 使用 Kafka Spout 的 Apache Storm 出现错误:IllegalStateException
- python - 如何在控制台中运行没有复制源的单元格?
- gwt - GWT Bootstrap 预输入事件监听器?
- mysql - 为什么我的 SQL 语句有时没有返回结果?
- javascript - 使用 JavaScript 将 qr 码生成的值放入输入字段的问题
- spring-boot - Spring引导休息调用中的无效mimetype异常
- postgresql - 表的浮动模式定义 - 如何在 pl/sql 中
- javascript - 为什么 Diff-match-patch 破坏 linediff 超过 65K 行