python - 一段时间后 Scrapy 超时
问题描述
我致力于从https://www.dailynews.co.th抓取文本,这是我的问题。
我的蜘蛛一开始工作几乎完美,爬了大约 4000 页。
2018-09-28 20:05:00 [scrapy.extensions.logstats] INFO: Crawled 4161 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
然后它开始从几乎所有的 url 中引发大量的 TimeoutErrors,就像这样。
2018-09-28 20:06:06 [scrapy.core.scraper] ERROR: Error downloading <GET https://www.dailynews.co.th/tags/When%20Will%20You%20Marry>
Traceback (most recent call last):
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/python/failure.py", line 491, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/Twisted-18.7.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/usr/local/app/.local/share/virtualenvs/monolingual-6kEg5ui2/lib/python2.7/site-packages/scrapy/core/downloader/handlers/http11.py", line 351, in _cb_timeout
raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
TimeoutError: User timeout caused connection failure: Getting https://www.dailynews.co.th/tags/When%20Will%20You%20Marry took longer than 5.0 seconds..
这是我的第二次尝试,我将 CONCURRENT_REQUESTS 从 32 减少到 16,将 AUTOTHROTTLE_TARGET_CONCURRENCY 从 32.0 减少到 4.0,并将 DOWNLOAD_TIMEOUT 从 15 减少到 5。问题没有解决,但我得到的页面比第一次尝试更多(从 1000 到 4000)。
我还尝试scrapy shell
了失败的 url(当我的蜘蛛仍在运行时),我得到了 200 的响应,这意味着连接本身很好。
我想知道我是否被禁止或其他原因,有人可以给我一个线索吗?非常感谢。
仅供参考,这是我的设置文件。
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'protocol.middlewares.RotateUserAgentMiddleware': 110,
'protocol.middlewares.MaximumAbsoluteDepthFilterMiddleware': 80,
'protocol.middlewares.ProxyMiddleware': 543
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 0.1
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 4.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False
DOWNLOAD_TIMEOUT = 5
DEPTH_LIMIT = 100
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
这是我的蜘蛛代码。
# -*- coding: utf-8 -*-
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from parse_config import parse_config
class ProtocolSpider(CrawlSpider):
name = 'protocol'
start_urls = ['https://www.dailynews.co.th,']
custom_settings = {
'JOBDIR': 'crawl_job'
}
def __init__(self, **kwargs):
super(ProtocolSpider, self).__init__(**kwargs)
self.arg_dict = parse_config(kwargs)
self.start_urls = self.arg_dict['start_urls']
# print self.start_urls
self.allowed_domains = self.arg_dict['allowed_domains']
self.output_file = open(self.arg_dict['output_file'], 'ab')
self.rules = (
Rule(LinkExtractor(allow=self.arg_dict['allow_url'], deny=self.arg_dict['deny_url']),
callback="parse_all", follow=True),
)
self._compile_rules()
self.use_web_proxy = self.arg_dict['use_web_proxy']
def parse(self, response):
self.parse_all(response)
return super(ProtocolSpider, self).parse(response)
def parse_all(self, response):
self._record_url(response)
self._extract_all_p(response)
self._extract_all_div(response)
def _record_url(self, response):
self.output_file.write('url_marker: %s' % response.url + '\n')
def _extract_all_p(self, response):
if self.arg_dict['extract_all_p']:
p_ls = response.xpath('//p/text()').extract()
p_string = '\n'.join([p.strip().encode('utf8') for p in p_ls if p.strip()])
self.output_file.write(p_string + '\n')
def _extract_all_div(self, response):
if self.arg_dict['extract_all_div']:
div_ls = response.xpath('//div/text()').extract()
div_string = '\n'.join([div.strip().encode('utf8') for div in div_ls if div.strip()])
self.output_file.write(div_string + '\n')
def close(self, spider, reason):
self.output_file.close()
return super(ProtocolSpider, self).close(spider, reason)
解决方案
推荐阅读
- r - 绘制拟合正态分布
- c - 验证两个 USB 设备是否连接在同一个集线器上
- c++ - 成员函数指针作为模板争论在继承的成员函数上失败,如何以及为什么?
- android - 添加 enterAlwaysCollapsed scrollFlag 后工具栏不显示标题
- javascript - 随浏览器窗口大小变化英雄图像的高度
- google-compute-engine - Google Compute Instance RDP 失败(工作多年后)
- java - Java 程序,它打开一个包含整数的文件,并在输出对话框中打印、计数和显示。(使用 JOptionPane)
- javascript - Socket.io 和 Jquery 带来了不需要的额外数据
- docker - 如何获取其中基于 cgroup(如 docker)的容器的平均负载?
- clojure - 在 Clojure 中调试执行缓慢的函数