python - 尝试使用 Scrapy 和 Splash 抓取 JS 页面时出错
问题描述
但是我一直在shell中遇到这个问题。
2018-09-13 14:50:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-13 14:50:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6028
2018-09-13 14:50:37 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-09-13 14:50:38 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://localhost:8050/robots.txt> (referer: None)
2018-09-13 14:51:10 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-09-13 14:51:36 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2018-09-13 14:51:40 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 2 times): 504 Gateway Time-out
2018-09-13 14:52:00 [scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (failed 3 times): 502 Bad Gateway
2018-09-13 14:52:00 [scrapy.core.engine] DEBUG: Crawled (502) <GET http://quotes.toscrape.com/js/ via http://localhost:8050/render.html> (referer: None)
2018-09-13 14:52:00 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <502 http://quotes.toscrape.com/js/>: HTTP status code is not handled or not allowed
这是我的代码:
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = "jsscraper"
start_urls = ["http://quotes.toscrape.com/js/"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse, endpoint='render.html')
def parse(self, response):
for quote in response.css("div.quote"):
scraped_info={
'authorname':quote.css('small.author::text').extract_first(),
'quote':quote.css('span.text::text').extract_first(),}
yield scraped_info
我已经安装了scrapy-splash,并且我还将这些命令放在了settings.py 中。我的启动服务器也在 http://localhost:8050/上运行。
此外,当我尝试在启动服务器上呈现任何 url 时,我收到另一个错误:
HTTP 错误 400(错误请求)类型:ScriptError -> LUA_ERROR 执行 Lua 脚本时发生错误
Lua错误:[字符串“函数main(splash,args)...”]:2:network3
我在用:
初始版本:3.2
路亚 5.2
解决方案
推荐阅读
- git - 在拉取请求完成时自动重命名文件
- python - pygame.key.get_pressed() 在我的代码中不起作用,我不知道为什么
- bluetooth - 蓝牙阅读器和应用项目的可行性
- angular - 如何在 Angular 应用程序中处理来自观察者的大量 http 响应数据以避免浏览器崩溃?
- feathersjs - Quasar + Feathers-Vuex:如何整合?
- julia - 强化学习 SARSA 算法随时间减少值
- php - 我想获取我的下拉列表数据我写了一个代码但它不工作
- laravel - Laravel Eloquent 过滤结果
- html - 搜索链接并单击它
- android - Flutter 可移动容器