首页 > 解决方案 > Scrapy 的 JSON 输出形成一个 JSON 对象数组

问题描述

我正在尝试使用 Scrapy 抓取游戏信息网站。抓取过程如下:抓取类别 -> 抓取游戏列表(每个类别有多个页面) -> 抓取游戏信息。抓取的信息应该进入一个 json 文件。我得到以下结果:

[
    {"category": "cat1", "games": [...]},
    {"category": "cat2", "games": [...]},
    ...
]

但我想得到这个结果:

{ "categories":
    [
        {"category": "cat1", "games": [...]},
        {"category": "cat2", "games": [...]},
        ...
    ]
}

我尝试使用这篇文章这篇文章中的步骤,但没有成功。找不到更多相关问题。

我将不胜感激任何帮助。

我的蜘蛛:

import scrapy
from ..items import Category, Game

class GamesSpider(scrapy.Spider):
    name = 'games'
    start_urls = ['https://www.example.com/categories']
    base_url = 'https://www.exmple.com'

    def parse(self, response):
        categories = response.xpath("...")

        for category in categories:
            cat_name = category.xpath(".//text()").get()
            url = self.base_url + category.xpath(".//@href").get()    
            
            cat = Category()
            cat['category'] = cat_name
            
            yield response.follow(url=url, 
                                  callback=self.parse_category, 
                                  meta={ 'category': cat })

    def parse_category(self, response):
        games_url_list = response.xpath('//.../a/@href').getall()

        cat = response.meta['category']
        url = self.base_url + games_url_list.pop()
        next_page = response.xpath('//a[...]/@href').get()
        
        if next_page:
            next_page = self.base_url + response.xpath('//a[...]/@href').get()

        yield response.follow(url=url, 
                              callback=self.parse_game, 
                              meta={'category': cat, 
                                    'games_url_list': games_url_list, 
                                    'next_page': next_page})
            
    def parse_game(self, response):
        cat = response.meta['category']
        game = Game()

        try:
            cat['games_list']
        except:
            cat['games_list'] = []
        
        game['title_en'] = response.xpath('...')
        game['os'] = response.xpath('...')
        game['users_rating'] = response.xpath('...')
 
        cat['games_list'].append(game)

        games_url_list = response.meta['games_url_list']
        next_page = response.meta['next_page']
        
        if games_url_list: 
            url = self.base_url + games_url_list.pop()
            yield response.follow(url=url, 
                                  callback=self.parse_game, 
                                  meta={'category': cat, 
                                        'games_url_list': games_url_list, 
                                        'next_page': next_page})

        else:
            if next_page:
                yield response.follow(url=next_page, 
                                      callback=self.parse_category, 
                                      meta={'category': cat})
            else:
                yield cat

我的 item.py 文件:

import scrapy

class Category(scrapy.Item):
    category = scrapy.Field()
    games_list = scrapy.Field()

class Game(scrapy.Item):
    title_en = scrapy.Field()
    os = scrapy.Field()
    users_rating = scrapy.Field()

标签: pythonjsonweb-scrapingscrapy

解决方案


您需要编写自定义项目导出器,或单独处理 Scrapy 生成的文件的后处理,例如使用独立的 Python 脚本将输出格式转换为所需格式。


推荐阅读