首页 > 解决方案 > 在scrapy中间件中实例化数据库连接并在其他模块中访问

问题描述

我在一个项目中有几个不同的蜘蛛共享同一个数据库,我有不同的项目类,所以我可以在管道中正确处理它们并将它们发送到所需的目的地。在我的第一个蜘蛛中,数据库在管道中被实例化,如下所示:

def __init__(self, database, user, password, host, port):
    self.database = database
    self.user = user
    self.password = password
    self.host = host
    self.port = port

@classmethod
def from_crawler(cls, crawler):
    db_settings = crawler.settings.getdict("DB_SETTINGS")
    if not db_settings:
        raise NotConfigured
    db = db_settings['database']
    user = db_settings['user']
    password = db_settings['password']
    host = db_settings['host']
    port = db_settings['port']
    return cls(db, user, password, host, port)

def open_spider(self, spider):
    self.connection = psycopg2.connect(database=self.database, user=self.user, password=self.password,
                                       host=self.host, port=self.port)
    self.cursor = self.connection.cursor()

def close_spider(self, spider):
    self.cursor_close = self.cursor.close()
    self.connection_close = self.connection.close()

这工作正常,但是对于我的第二个蜘蛛,我需要从蜘蛛本身的数据库中访问一些数据,以便我可以开始爬行,然后将项目发送到管道以将它们保存在数据库中。我可以使用相同的代码在蜘蛛中实例化数据库并停止在管道中执行它,但是有多个蜘蛛,我不想一遍又一遍地重复这个过程。我想知道如何在中间件中实例化数据库连接并在蜘蛛和管道中访问它。我想我可以使用上面相同的代码来启动数据库芽我不知道如何调整它来访问光标和蜘蛛和管道中的连接

标签: pythonpostgresqlscrapyweb-crawler

解决方案


这就是我让它工作的方式,你可以在中间件中这样做:

## MiddleWare

class DBMiddleware(object):

    def __init__(self, db_settings):
        self.db_setting = db_settings

    @classmethod
    def from_crawler(cls, crawler):
        db_settings = crawler.settings.getdict("DB_SETTINGS")
        if not db_settings:  # if we don't define db config in settings
            raise NotConfigured

        s = cls(db_settings)
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s

    def spider_opened(self, spider):
        spider.connection = psycopg2.connect(database=self.db_setting['database'],
                                           user=self.db_setting['user'],
                                           password=self.db_setting['password'],
                                           host=self.db_setting['host'],
                                           port=self.db_setting['port'])


    def spider_closed(self, spider):
        spider.connection.close()

然后您可以将其添加到蜘蛛以访问刚刚创建的连接

## Spider
class MainSpider(scrapy.Spider):

    name = 'main_spider'
    start_urls = ['www.example.com']

    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return s 

    def spider_opened(self, spider):
        pass

    def parse(self, response):
        cursor = self.connection.cursor()
        sql = "SELECT * FROM companies"
        cursor.execute(sql)
        result = cursor.fetchall()
        for element in result:
            loader = ItemLoader(item=Item_profile(), selector=element)
            loader.add_value('name', element[0])
            items = loader.load_item()
            yield items

    def spider_closed(self, spider):
        pass

如果您只想在蜘蛛解析方法中访问数据库连接,这很好用,但我需要的是在解析方法之前打开连接,这样我就可以分别从 db 和 crwal 中检索链接,所以我需要##Spider spider_opened() 方法中的连接,但激活方法的顺序是这样的:

1: #Spider __init__()
2: #Spider spider_opened()
3: #Middleware spider_opened() -->> connection is  created here
4: #Spider parse()
5: #Spider spider_closed()
6: #Middleware spider_closed()

这是合乎逻辑的,因为根据文档,中间件的主要功能是位于引擎和蜘蛛之间,我们需要的是一个在scrapy启动时实例化的模块,这将是一个Extension。因此,我在中间件、管道等旁边创建了一个名为extentions.py根目录的文件,并添加了与中间件中相同的代码:

from scrapy import signals
from scrapy.exceptions import NotConfigured
import psycopg2


class DBExtension(object):
    def __init__(self, db_settings):
        self.db_setting = db_settings
        pass

    @classmethod
    def from_crawler(cls, crawler):
        db_settings = crawler.settings.getdict("DB_SETTINGS")
        if not db_settings:
            raise NotConfigured
        s = cls(db_settings)
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(s.spider_closed, signal=signals.spider_closed)
        return (s)

    def spider_opened(self, spider):
        spider.connection = psycopg2.connect(database=self.db_setting['database'],
                                           user=self.db_setting['user'],
                                           password=self.db_setting['password'],
                                           host=self.db_setting['host'],
                                           port=self.db_setting['port'])


    def spider_closed(self, spider):
        spider.connection.close()

然后我在settings.py

EXTENSIONS = {
    'ProjectName.extensions.DBExtension': 400
}

现在您可以在#Spider spider_opened() 方法中访问此连接,self.connection并且可以在开始爬行之前从数据库中加载信息。我不知道是否有更优化的方法来解决这个问题,但现在它可以为我完成这项工作。


推荐阅读