首页 > 解决方案 > 如何使用 Scrapy Crawler 和 Splash 来抓取 Javascript 页面

问题描述

我在使用 Scrapy Crawler 抓取 javascript 网站时遇到了麻烦。看起来 Scrapy 忽略了规则,只是继续正常的抓取。

是否可以指示 Spider 使用 Splash 进行爬行?

谢谢你。

class MySpider(CrawlSpider):
    name = 'booki'
    start_urls = [
    'https://worldmap.com/listings/in/united-states/',

    ]
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('catalogue', ),deny=('catalogue\/category')), callback='first_tier'),
#        )
    custom_settings = {
        #'DOWNLOAD_DELAY' : '2',
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
        'DOWNLOAD_DELAY' : '8',
        'ITEM_PIPELINES' : {
            'bookstoscrap.pipelines.BookstoscrapPipeline': 300,
        }
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.first_tier,
                endpoint='render.html',
                args={'wait': 3.5},
            )

标签: pythonscrapy

解决方案


只有当您在start_requests. 您还需要callback为您的规则定义函数,否则它们将尝试使用默认值parse(以防您的规则看起来好像什么都不做)。

要将规则的请求更改为您必须在回调SplashRequest中返回它。process_request例如:

class MySpider(CrawlSpider):
    # ...

    rules = (
        Rule(
            LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', )),
            process_request='splash_request'
        ),
        Rule(
            LinkExtractor(allow=('catalogue', ), deny=('catalogue\/category'),
            callback='first_tier',
            process_request='splash_request'
        ),
    )

    # ...

    def splash_request(self, request):
        return SplashRequest(
            request.url,
            callback=request.callback,
            endpoint='render.html',
            args={'wait': 3.5},
        )

推荐阅读