首页 > 解决方案 > Scrapy - pymongo 没有将项目插入数据库

问题描述

所以我玩scrapy试图学习,并使用MongoDB作为我的数据库我走到了死胡同。基本上,当我获取的项目显示在终端日志中时,抓取工作,但我无法获取要在我的数据库上发布的数据。MONGO_URI 是正确的,因为我在 python shell 中尝试过它,我可以在其中创建和存储数据..

这是我的文件

项目.py


import scrapy

class MaterialsItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
   ## url = scrapy.Field()
    pass

蜘蛛.py

import scrapy
from scrapy.selector import Selector

from ..items import MaterialsItem

class mySpider(scrapy.Spider):
    name = "<placeholder for post>"
    allowed_domains = ["..."]
    start_urls = [
   ...
    ]

    def parse(self, response):
        products = Selector(response).xpath('//div[@class="content"]')

        for product in products:        
                item = MaterialsItem()
                item['title'] = product.xpath("//a[@class='product-card__title product-card__title-v2']/text()").extract(),
                item['price'] = product.xpath("//div[@class='product-card__price-value ']/text()").extract()
               ## product['url'] = 
                yield item

设置.py

MONGO_PIPELINES = {
    'materials.pipelines.MongoPipeline': 300,
}


#setup mongo DB
MONGO_URI = "my MongoDB Atlas address"
MONGO_DB = "materials"

管道.py

import pymongo

class MongoPipeline(object):

    collection_name = 'my-prices'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        ## pull in information from settings.py
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB', ', <placeholder-spider name>')

        )

    def open_spider(self, spider):
        ## initializing spider
        ## opening db connection
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        ## clean up when spider is closed
        self.client.close()

    def process_item(self, item, spider):
        ## how to handle each post
        self.db[self.collection_name].insert(dict(item))
        logging.debug("Post added to MongoDB")
        return item

任何帮助都会很棒!

**编辑

文件结构

materials
  spiders
  my-spider
items.py
pipelines.py
settings.py

标签: pythonscrapypymongo

解决方案


MongoPipeline 类中的行不应该是:

collection_name = 'my-prices'

是:

self.collection_name = 'my-prices'

既然你打电话:

self.db[self.collection_name].insert(dict(item))

推荐阅读