selenium - Scrapy/Selenium:在脚本停止之前发送超过 3 个失败的请求
问题描述
我目前正在尝试抓取一个网站(大约 500 个子页面)。
该脚本运行得非常顺利。但是,运行 3 到 4 小时后,我有时会收到错误消息,您可以在下面找到。我认为问题不是脚本,而是网站服务器速度很慢。
我的问题是:在脚本自动停止/关闭蜘蛛之前,是否有可能发送超过 3 个“失败的请求”?
2019-09-27 10:53:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 1 pages/min), scraped 4480 items (at 10 items/min)
2019-09-27 10:54:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 1 times): 504 Gateway Time-out
2019-09-27 10:54:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:55:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 2 times): 504 Gateway Time-out
2019-09-27 10:55:46 [scrapy.extensions.logstats] INFO: Crawled 448 pages (at 0 pages/min), scraped 4480 items (at 0 items/min)
2019-09-27 10:56:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (failed 3 times): 504 Gateway Time-out
2019-09-27 10:56:00 [scrapy.core.engine] DEBUG: Crawled (504) <GET https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480> (referer: https://blogabet.com/tipsters) ['partial']
2019-09-27 10:56:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <504 https://blogabet.com/tipsters/?f[language]=all&f[pickType]=all&f[sport]=all&f[sportPercent]=&f[leagues]=all&f[picksOver]=0&f[lastActive]=12&f[bookiesUsed]=null&f[bookiePercent]=&f[order]=followers&f[start]=4480>: HTTP status code is not handled or not allowed
2019-09-27 10:56:00 [scrapy.core.engine] INFO: Closing spider (finished)
添加了更新的代码
class AlltipsSpider(Spider):
name = 'alltips'
allowed_domains = ['blogabet.com']
start_urls = ('https://blogabet.com',)
def parse(self, response):
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver.get(url)
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date
self.driver.quit()
break
解决方案
您只需将RETRY_TIMES
设置更改为更高的数字即可。
您可以在RetryMiddleware
文档中阅读与重试相关的选项:https ://docs.scrapy.org/en/latest/topics/downloader-middleware.html#std:setting-RETRY_TIMES
推荐阅读
- tornado - 将 Tornado html 返回为 json 数据
- flutter - 用 mockito 测试 Riverpod 的正确方法是什么
- r - R ggpot:在一页上安排几个用循环创建的ggplots/对每个图进行不同的命名
- c++ - 在 c++ 代码中,set.erase(it) 正在停止执行,其中 it=set.begin() 用于一组对,为什么会发生这种情况?
- django - 模型(django)中主键的默认数据类型,当前路径,url '/app' 与这些中的任何一个都不匹配
- ruby - 将方法委托给方法的结果
- c++ - 如何找出动态特征向量的大小(以字节为单位)?
- python - 熊猫数据框选择当前列而不保存
- python - Python If 语句 - 显示数字结果
- javascript - 为什么过滤器会从我的数组中删除索引 0