python - Scrapy - 在第一次请求后收到 504 网关超时
问题描述
我使用 Scrapy 来抓取我们的内容,现在,我尝试与 Splash 集成来为页面运行 Javascript。问题是当我启动爬虫时,前 20 个请求大约返回空内容,而所有其他请求都返回 504 状态代码。为什么会这样?
这是日志文件:
2018-06-20 10:43:14 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:14.875348', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/camisa-compressao-adams-termica-ml-821229.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:14 [centauro] WARNING: Not valid item dropped! https://www.centauro.com.br/camisa-do-brasil-i-2018-nike-masculina-918516.html
2018-06-20 10:43:14 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:14.940786', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/camisa-do-brasil-i-2018-nike-masculina-918516.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html via http://0.0.0.0:8050/render.html> (referer: None)
2018-06-20 10:43:15 [centauro] WARNING: Not valid item dropped! https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html
2018-06-20 10:43:15 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:15.298537', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/tenis-oxer-netuno-masculino-913399.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/calca-termica-kappa-belquior-masculina-910118.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/jaqueta-oxer-water-repelent-feminina-858050.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/camiseta-do-brasil-2018-crest-nike-masculina-918483.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/tenis-fila-infinity-m00kil-mktp.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
这是我的蜘蛛开始报废的主要方法:
def start_requests(self):
mode = self.settings.get('MODE')
urls = util.get_urls_db(self.custom_settings['URLS_COLLECTION_NAME'])
urls = list(urls)
if mode == 'all':
for url in urls:
yield SplashRequest(url['url'], self.parse_item,
args={
# optional; parameters passed to Splash HTTP API
'timeout': 10,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
}
)
这是我的settings.py
:
SPLASH_URL = 'http://0.0.0.0:8050'
DOWNLOADER_MIDDLEWARES = {
# 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
# 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
# 'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'updater.middlewares.SeleniumMiddleware': 700,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
解决方案
尝试使用<real_ip>:8050
而不是0.0.0.0:8050
推荐阅读
- typescript - Typeorm 仅从连接表中获取某个字段
- prolog - 数学数字谜语中没有零
- python - 使用 python 在 Word DocX 中访问或修改绘图对象中的文本
- reactjs - 错误:React.Children.only 期望接收单个 React 元素子级。(反应管理员)
- html - 如何在 Ck Editor 4 中正确插入 span 标签
- neo4j - 如何在 Neo4j 中从某个顶点开始获取某个深度的子图
- javascript - 如何在我的自定义 Gutenberg 块中使用 ACF 字段作为属性并使其保持最新?
- struct - 在 common lisp (SBCL) 中生成结构名称
- python - Pytest-bdd 夹具运行 pytest 夹具两次
- html - WordPress 网站标题布局