首页 > 解决方案 > 在scrapy+selenium中,如何让蜘蛛请求等到前一个请求完成处理?

问题描述

TL;博士

在scrapy中,我希望请求等到所有蜘蛛解析回调完成。所以整个过程需要有顺序。像这样:

Request1 -> Crawl1 -> Request2 -> Crawl2 ...

但是现在发生了什么:

Request1 -> Request2 -> Request3 ...
            Crawl1      
                        Crawl2
                                 Crawl3 ...

长版

我是scrapy + selenium网络抓取的新手。我正在尝试抓取一个使用 javascript 大量更新内容的网站。首先,我使用 selenium 打开网站并登录。之后,我创建了一个使用下载器中间件,该中间件使用 selenium 处理请求并返回响应。下面是中间件的process_request实现:

class XYZDownloaderMiddleware:
    '''Other functions are as is. I just changed this one'''
    def process_request(self, request, spider):
        driver = request.meta['driver']

        # We are opening a new link
        if request.meta['load_url']:
            driver.get(request.url)
            WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))
        # We are clicking on an element to get new data using javascript.
        elif request.meta['click_bet']:
            element = request.meta['click_bet']
            element.click()
            WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, request.meta['wait_for_xpath'])))

        body = driver.page_source
        return HtmlResponse(driver.current_url, body=body, encoding="utf-8", request=request)

在设置中,我也设置CONCURRENT_REQUESTS = 1了,多个driver.get()不调用,selenium 可以一一和平加载响应。

现在我看到发生的是 selenium 打开每个 URL,scrapy 让 selenium 等待响应完成加载,然后中间件正确返回响应(进入if response.meta['load_url']阻塞)。

但是,在我得到响应后,我想使用 selenium 驱动程序(在parse(response)函数中)通过产生一个请求来单击每个元素,并从中间件(elif request.meta['click_bet']块)返回更新的 HTML。

蜘蛛最低限度是这样的:

class XYZSpider(scrapy.Spider):
    def start_requests(self):
        start_urls = [
            'https://www.example.com/a',
            'https://www.example.com/b'
        ]
        self.driver = self.getSeleniumDriver()
        for url in start_urls:
            request = scrapy.Request(url=url, callback=self.parse)
            request.meta['driver'] = self.driver
            request.meta['load_url'] = True
            request.meta['wait_for_xpath'] = '/div/bla/bla'
            request.meta['click_bet'] = None
            yield request


    def parse(self, response):
        urls = response.xpath('//a/@href').getall()
        for url in start_urls:
            request = scrapy.Request(url=url, callback=self.rightSectionParse)
            request.meta['driver'] = self.driver
            request.meta['load_url'] = True
            request.meta['wait_for_xpath'] = '//div[contains(@class, "rightSection")]'
            request.meta['click_bet'] = None
            yield request
    def rightSectionParse(self, response):
        ...

所以发生的事情是,scrapy 没有等待蜘蛛解析。Scrapy 获取响应,然后并行调用解析回调和下一个获取响应。但是 selenium 驱动需要在下一个请求处理之前被 parse 回调函数使用。

我希望请求等到解析回调完成。

标签: pythonseleniumselenium-webdriverscrapyrequest

解决方案


推荐阅读