首页 > 解决方案 > 将蜘蛛抓取不同网站的数据放入 CSV 文件(Scrapy)

问题描述

为了尝试组合 2 个不同的 Scrapy 蜘蛛,它们会抓取不相关的网站,我创建了这个脚本。但现在我似乎无法将数据放入普通的 csv 或 json 文件中。在我合并蜘蛛之前,我只会'scrapy crawl afg2 -o data_set.csv',但现在这似乎不起作用。

仍然在 csv 文件中获取数据的最简单方法是什么?这是我的代码:

import scrapy
from scrapy.crawler import CrawlerProcess


class KhaamaSpider1(scrapy.Spider):
    name = 'khaama1'
    allowed_domains = ['www.khaama.com/category/afghanistan']
    start_urls = ['https://www.khaama.com/category/afghanistan']

    def parse(self, response):
        container = response.xpath("//div[@class='post-area']")
        for x in container:
            doc = x.xpath(".//div[@class='blog-author']/descendant::node()[4]").get()
            title = x.xpath(".//div[@class='blog-title']/h3/a/text()").get()
            author = x.xpath(".//div[@class='blog-author']/a/text()").get()
            rel_url = x.xpath(".//div[@class='blog-title']/h3/a/@href").get()

            yield{
                'date_of_creation' : doc,
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }

class PajhwokSpider1(scrapy.Spider):
    name = 'pajhwok1'
    allowed_domains = ['www.pajhwok.com']
    start_urls = ['https://www.pajhwok.com/en/security-crime']

    def parse(self, response):
        container = response.xpath("//div[@class='node-inner clearfix']")
        for x in container:
            doc = x.xpath(".//div[@class='journlist-creation-article']/descendant::div[5]/text()").get()
            title = x.xpath(".//h2[@class='node-title']/a/text()").get()
            author = x.xpath(".//div[@class='field-item even']/a/text()").get()
            rel_url = x.xpath(".//h2[@class='node-title']/a/@href").get()

            yield{
                'date_of_creation' : doc,
                'title' : title,
                'author' : author,
                'rel_url' : rel_url
            }
        
process = CrawlerProcess()
process.crawl(KhaamaSpider1)
process.crawl(PajhwokSpider1)
process.start()

标签: pythonweb-scrapingscrapy

解决方案


例如 2 个蜘蛛的 pipeliine.py。它将在第二个蜘蛛之后关闭 json 文件。您可以在此处获取更多信息https://docs.scrapy.org/en/latest/topics/item-pipeline.html

import json

from itemadapter import ItemAdapter

spider_count = 2

class JsonWriterPipeline:
    file = open('items.json', 'w')

    def open_spider(self, spider):
        return None

    def close_spider(self, spider):
        global spider_count
        spider_count -= 1
        if spider_count == 0:
            self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(ItemAdapter(item).asdict()) + "\n"
        self.file.write(line)
        return item

推荐阅读