首页 > 解决方案 > 将输出保存为 JSON 格式

问题描述

我正在尝试将我的输出写入og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"])JSON 文件。但是当我看到验证输出时,它说它不是正确的 JSON 标准共振峰。谁能帮助我,我做错了什么。

# -*- coding: utf-8 -*-
import scrapy
from..items import news18Item
import re
from webpreview import web_preview
from webpreview import OpenGraph
import json

class News18SSpider(scrapy.Spider):
    name = 'news18_story'
    page_number = 2
    start_urls = ['https://www.news18.com/movies/page-1/']

    def parse(self, response):
        items = news18Item()
        page_id = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
        items['page_id'] = page_id

        story_url = page_id

        for i in story_url :
            og = OpenGraph(i, ["og:title", "og:description", "og:image", "og:url"])

            dictionary =[{ "page_title": og.title }, { "description": og.description }, { "image_url": og.image }, { "post_url": og.url}] 

            with open("news18_new.json", "a") as outfile: 
                json.dump(dictionary, outfile)
                outfile.write("\n")
                # json.dump("\n",outfile) 



        next_page = 'https://www.news18.com/movies/page-' + str(News18SSpider.page_number) + '/'
        if News18SSpider.page_number <= 20:
           News18SSpider.page_number += 1  
           yield response.follow(next_page, callback = self.parse)

        pass

标签: pythonjsonpython-3.xweb-scrapingscrapy

解决方案


这是最小的工作代码。

您可以将所有代码放在一个文件中script.py并运行,python script.py而无需创建项目。

我将每个项目都作为单个字典

  yield {
            "page_title": og.title,
            "description": og.description,
            "image_url": og.image,
            "post_url": og.url
        } 

并将scrapy其保存为正确的JSON文件,其中包含一个包含许多词典的列表。

您创建了许多单独的列表 - 这不是正确的 JSON 格式。

JSON文件不是可以附加新数据的格式。它必须将所有数据读取到内存,将新项目附加到内存中的数据,然后将所有数据再次保存到文件中。

CSV您可以在不将所有数据读取到内存的情况下附加到文件。


import scrapy
from webpreview import OpenGraph

class News18SSpider(scrapy.Spider):

    name = 'news18_story'
    page_number = 1
    start_urls = ['https://www.news18.com/movies/page-1/']

    def parse(self, response):
        #all_hrefs = response.xpath('/html/body/div[2]/div[5]/div[2]/div[1]/div[*]/div[*]/p/a/@href').getall()
        all_hrefs = response.xpath('//div[@class="blog-list-blog"]/p/a/@href').getall()

        for href in all_hrefs:
            og = OpenGraph(href, ["og:title", "og:description", "og:image", "og:url"])

            yield {
                "page_title": og.title,
                "description": og.description,
                "image_url": og.image,
                "post_url": og.url
            } 

        if self.page_number <= 20:
            self.page_number += 1  
            next_url = 'https://www.news18.com/movies/page-{}/'.format(self.page_number)
            #yield response.follow(next_url) # , callback=self.parse)
            yield scrapy.Request(next_url)

# --- run without project and save in `output.json` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    #'USER_AGENT': 'Mozilla/5.0',
    'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',

    # save in file CSV, JSON or XML
    'FEED_FORMAT': 'json',     # csv, json, xml
    'FEED_URI': 'output.json', #
})

c.crawl(News18SSpider)
c.start() 

推荐阅读