首页 > 解决方案 > 如何减少scrapy在蜘蛛上运行爬网时产生的selenium webdriver实例的数量?

问题描述

在为任何蜘蛛运行爬网过程时,即使正在运行的蜘蛛没有使用硒,Scrapy 也会产生很多(27 个平均在 19 - 30 之间变化)Firefox 实例。

我已经尝试driver.quit()def __del__(self)每个蜘蛛内部使用硒。问题仍然存在。

即使在爬取过程完成后,Firefox 实例仍保持打开状态。

使用硒的示例蜘蛛:

import logging
import time
from os.path import abspath, dirname, join
import requests
import scrapy
import selenium
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.remote.remote_connection import LOGGER
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

LOGGER.setLevel(logging.ERROR)

PATH_DIR = dirname(abspath(__file__))
GECKODRIVER_PATH = abspath(join(PATH_DIR, "../../geckodriver"))
WAIT_TIME = 10

class ExampleSpider(sso_singapore.SsoSpider):

    name = "Example"

    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options, executable_path=GECKODRIVER_PATH)

    def __del__(self):
        self.driver.quit()

    def parse(self, response):

        meta = response.meta
        try:
            self.driver.get(response.url)
            body = self.driver.page_source
            try:
                element = WebDriverWait(self.driver, WAIT_TIME).until(
                    EC.presence_of_element_located(
                        (By.ID, '//select[@id="rows_sort"]/option[text()="All"]')
                    )
                )
            except:
                pass
            response = HtmlResponse(
                self.driver.current_url, body=body, encoding="utf-8"
            )

        except Exception as e:
            logging.error(str(e))
        finally:
            self.driver.quit()
       # Create Items based on response

    def start_requests(self):

        for url, meta in zip(urls, meta_list):
            yield scrapy.Request(url, callback=parse, meta=meta)


任何帮助都感激不尽。

标签: seleniumscrapy

解决方案


from scrapy import signals

class ExampleSpider(sso_singapore.SsoSpider):

    def __init__(self, *args, **kwargs):
        options = Options()
        options.headless = True
        self.driver = webdriver.Firefox(options=options, executable_path="your_path")

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(ExampleSpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def spider_closed(self, spider):
        self.driver.quit()

这应该可以完成这项工作。

更多关于 Scrapy 信号:

https://docs.scrapy.org/en/latest/topics/signals.html

如果您有许多蜘蛛并且不想添加相同的driver.quit()逻辑,您也可以使用 Pipeline:

class YourPipeline:

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_closed(self, spider):
        if hasattr(spider, 'driver'):
            spider.driver.quit()

推荐阅读