python - 如何使用 Scrapy Crawler 和 Splash 来抓取 Javascript 页面
问题描述
我在使用 Scrapy Crawler 抓取 javascript 网站时遇到了麻烦。看起来 Scrapy 忽略了规则,只是继续正常的抓取。
是否可以指示 Spider 使用 Splash 进行爬行?
谢谢你。
class MySpider(CrawlSpider):
name = 'booki'
start_urls = [
'https://worldmap.com/listings/in/united-states/',
]
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('catalogue', ),deny=('catalogue\/category')), callback='first_tier'),
# )
custom_settings = {
#'DOWNLOAD_DELAY' : '2',
'SPLASH_URL': 'http://localhost:8050',
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'DOWNLOAD_DELAY' : '8',
'ITEM_PIPELINES' : {
'bookstoscrap.pipelines.BookstoscrapPipeline': 300,
}
}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.first_tier,
endpoint='render.html',
args={'wait': 3.5},
)
解决方案
只有当您在start_requests
. 您还需要callback
为您的规则定义函数,否则它们将尝试使用默认值parse
(以防您的规则看起来好像什么都不做)。
要将规则的请求更改为您必须在回调SplashRequest
中返回它。process_request
例如:
class MySpider(CrawlSpider):
# ...
rules = (
Rule(
LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', )),
process_request='splash_request'
),
Rule(
LinkExtractor(allow=('catalogue', ), deny=('catalogue\/category'),
callback='first_tier',
process_request='splash_request'
),
)
# ...
def splash_request(self, request):
return SplashRequest(
request.url,
callback=request.callback,
endpoint='render.html',
args={'wait': 3.5},
)
推荐阅读
- c++ - 为什么最好使用一组匿名结构而不是一组联合?
- java - 如何在firebase android studio中的两个不同子节点中拆分订单,如酒吧和厨房柜台发送订单两个不同的柜台
- php - Laravel 命名路由和 href
- pipeline - 除了 UI 之外,我如何导入 CDAP 管道?
- java - GraphQL 和 Zuul:Graphql 将作为一个休息端点工作,而 zuul 作为 api 网关
- yii - Decode base64 into jpeg and save the image to server
- react-native - 如何使用 React Native Stack Navigation 处理锁定的方向
- java - how can i force Jformattedtextfield to add commas before every 3 digits in java
- jquery - 折叠搜索元素不适用于备用键事件
- python - 选择具有定义日期时间的行并保存在 CSV 文件中