selenium - 如何减少scrapy在蜘蛛上运行爬网时产生的selenium webdriver实例的数量?
问题描述
在为任何蜘蛛运行爬网过程时,即使正在运行的蜘蛛没有使用硒,Scrapy 也会产生很多(27 个平均在 19 - 30 之间变化)Firefox 实例。
我已经尝试driver.quit()
在def __del__(self)
每个蜘蛛内部使用硒。问题仍然存在。
即使在爬取过程完成后,Firefox 实例仍保持打开状态。
使用硒的示例蜘蛛:
import logging
import time
from os.path import abspath, dirname, join
import requests
import scrapy
import selenium
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.remote.remote_connection import LOGGER
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
LOGGER.setLevel(logging.ERROR)
PATH_DIR = dirname(abspath(__file__))
GECKODRIVER_PATH = abspath(join(PATH_DIR, "../../geckodriver"))
WAIT_TIME = 10
class ExampleSpider(sso_singapore.SsoSpider):
name = "Example"
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options, executable_path=GECKODRIVER_PATH)
def __del__(self):
self.driver.quit()
def parse(self, response):
meta = response.meta
try:
self.driver.get(response.url)
body = self.driver.page_source
try:
element = WebDriverWait(self.driver, WAIT_TIME).until(
EC.presence_of_element_located(
(By.ID, '//select[@id="rows_sort"]/option[text()="All"]')
)
)
except:
pass
response = HtmlResponse(
self.driver.current_url, body=body, encoding="utf-8"
)
except Exception as e:
logging.error(str(e))
finally:
self.driver.quit()
# Create Items based on response
def start_requests(self):
for url, meta in zip(urls, meta_list):
yield scrapy.Request(url, callback=parse, meta=meta)
任何帮助都感激不尽。
解决方案
from scrapy import signals
class ExampleSpider(sso_singapore.SsoSpider):
def __init__(self, *args, **kwargs):
options = Options()
options.headless = True
self.driver = webdriver.Firefox(options=options, executable_path="your_path")
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(ExampleSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def spider_closed(self, spider):
self.driver.quit()
这应该可以完成这项工作。
更多关于 Scrapy 信号:
https://docs.scrapy.org/en/latest/topics/signals.html
如果您有许多蜘蛛并且不想添加相同的driver.quit()
逻辑,您也可以使用 Pipeline:
class YourPipeline:
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_closed(self, spider):
if hasattr(spider, 'driver'):
spider.driver.quit()
推荐阅读
- xpath - 仅选择不具有属性“paginas”的第一个元素“libro”
- sql - 视图 oracle SQL 的相关性
- python - python mysql connector multi=true 即使出现SQL错误也继续
- javascript - Google Maps GeoJSON 功能触发点击事件
- azure - 如何使用 Powershell 创建具有 HDD 操作系统磁盘的新 VM
- typescript - 导入的ES6模块在包中处理后不执行
- mongodb - Mongoose findOneAndUpdate 文档内数组中的数组
- flutter - 即使在插入值后 TextFormField 仍显示错误消息(使用 Form Flutter)
- postgresql - GROUP BY 和 DISTINCT ON 对表和视图的工作方式不同
- android - 未使用的消息计数在活动通道中返回 null