scrapy - Python:使用 CrawlSpider 的 Scrapy-Splash 递归爬行不起作用
问题描述
我在我的 CrawlSpider 中集成了scrapy-splash,它只抓取渲染 start_urls。想知道如何让 scrapy-splash 抓取内部链接。我一直在互联网上寻找解决方案,但似乎没有一个可行的解决方案。
以下是我的代码:
import scrapy
from scrapy.selector import Selector
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy.item import Item, Field
from scrapy import Request
#import requests
from scrapy_splash import SplashRequest
class Website(scrapy.Item):
url = Field()
response = Field()
class houzzspider(CrawlSpider):
handle_httpstatus_list = [404, 500]
name = "example"
allowed_domains = ["localhost","www.example.com"]
start_urls = ["https://www.example.com/"]
rules = (
Rule(
LinkExtractor(
allow=(),
deny=(),process_value=''),
callback="parse_items",
process_links="process_links",
follow=True,
),
Rule(
LinkExtractor(
allow=(),
deny=(),process_value=''),
follow=True,
),
)
def process_links(self, links):
for link in links:
if "http://localhost:8050/render.html?&" not in link.url:
link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,'wait':2.0})
return links
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse_items,
endpoint='render.html',
args={'wait': 0.5},)
def parse_items(self, response):
hxs = Selector(response)
sites = response.selector.xpath('//html')
items = []
for site in sites:
#print site
item = Website()
item['url'] = response.url
item['response'] = response.status
items.append(item)
return items
解决方案
推荐阅读
- angular - Angular 6:如何保护帐户激活链接的路由
- jquery - 更改 Fullcalendar 主题系统的默认值
- python - python - UnicodeEncodeError,需要修复
- java - Android Studio:“Gradle 同步失败:无法从选定的 JDK 运行 JVM。”
- pandas - 确定 Pandas Grouped By df 中的最大计数,并将其用作返回记录的标准
- javascript - 根据现有子数组中的值创建一个新数组,javascript
- ruby-on-rails - Rails - 多态has_many:通过模型不建立在创建
- python - 有什么方法可以使像素不可见的油漆更平滑?
- node.js - 在整个应用程序中使用单个数据库连接?
- objective-c - Objective-C中的weakSelf和strongSelf