首页 > 解决方案 > Scrapy下载中间件返回response对象时不执行del操作

问题描述

我正在编写一个将 Scrapy 连接到 Selenium 的下载中间件:

# spider.py
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    # allowed_domains = ['xxx.com']
    start_urls = ['http://httpbin.org/']

    def parse(self, response):
        print(response)

# Middleware.py 
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from scrapy.http import HtmlResponse


class SeleniumMiddleware:
    """Docking Selenium"""

    def __init__(self):
        self.browser = webdriver.Chrome()
        self.browser.maximize_window()
        self.wait = WebDriverWait(self.browser, 10)

    @classmethod
    def from_crawler(cls, crawler):
        return cls()

    def process_request(self, request, spider):
        try:
            # Determine which links need to be accessed using the selenium program
            if request.url in spider.start_urls:
                self.browser.get(request.url)

                # wait data load
                self.wait.until(EC.presence_of_element_located((
                    By.ID, 'operations-tag-HTTP_Methods')))

                page_text = self.browser.page_source  # get data

                # return Response
                print('1')
                return HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request, status=200)
        except TimeoutException:
            # Timeout
            print('2')
            return HtmlResponse(url=request.url, status=500, request=request)

    def process_exception(self, request, exception, spider):
        print(f'Error: {exception}')
        return None

    def __del__(self):
        print('Browser close~')
        self.browser.quit()

在测试程序的时候,发现在返回响应对象时,程序没有进行del操作,导致打开的浏览器无法按预期关闭。

输出结果:

1
<200 http://httpbin.org/>

可以看出程序运行的时候没有触发异常,但是程序并没有使用del删除这个中间件。

当我注释返回响应对象的代码时,程序会执行del操作:

# return Response
print('1')
# return HtmlResponse(url=request.url, body=page_text, encoding='utf-8', request=request, status=200)

输出结果:

1
<200 http://httpbin.org/>
Browser close~

我想知道是什么导致了这个结果,返回响应对象时应该如何进行整理操作(关闭浏览器)?

希望你能帮助我,谢谢。

已经解决,请spider_closed在创建中间件时使用方法并连接信号:

--skip--
@classmethod
def from_crawler(cls, crawler):
    o = cls()
    crawler.signals.connect(o.spider_closed, signals.spider_closed)

    return o

--skip--

def spider_closed(self):
    """Close the browser"""
    self.browser.quit()

标签: pythonscrapy

解决方案


推荐阅读