首页 > 解决方案 > 将嵌套的 JSON 解析为 pandas DataFrames

问题描述

我正在从包含股票收益数据的目标遗留系统中读取数据。数据以 JSON 格式导出到此收益模块等模块中。

earnings_dict = {
 "earningsChart": {
      "quarterly": [
           {
                "date": "1Q2018",
                "actual": {
                     "raw": 0.12,
                     "fmt": "0.12"
                },
                "estimate": {
                     "raw": 0.05,
                     "fmt": "0.05"
                }
           },
           {
                "date": "2Q2018",
                "actual": {
                     "raw": 0.21,
                     "fmt": "0.21"
                },
                "estimate": {
                     "raw": 0.19,
                     "fmt": "0.19"
                }
           },
           {
                "date": "3Q2018",
                "actual": {
                     "raw": 0.16,
                     "fmt": "0.16"
                },
                "estimate": {
                     "raw": 0.21,
                     "fmt": "0.21"
                }
           },
           {
                "date": "4Q2018",
                "actual": {
                     "raw": 0.07,
                     "fmt": "0.07"
                },
                "estimate": {
                     "raw": 0.14,
                     "fmt": "0.14"
                }
           }
      ],
      "currentQuarterEstimate": {
           "raw": 0.15,
           "fmt": "0.15"
      },
      "currentQuarterEstimateDate": "1Q",
      "currentQuarterEstimateYear": 2019,
      "earningsDate": [
           {
                "raw": 1556496000,
                "fmt": "2019-04-29"
           },
           {
                "raw": 1556841600,
                "fmt": "2019-05-03"
           }
      ]
 },
 "financialsChart": {
      "yearly": [
           {
                "date": 2015,
                "revenue": {
                     "raw": 74977000,
                     "fmt": "74.98M",
                     "longFmt": "74,977,000"
                },
                "earnings": {
                     "raw": -15668000,
                     "fmt": "-15.67M",
                     "longFmt": "-15,668,000"
                }
           },
           {
                "date": 2016,
                "revenue": {
                     "raw": 105586000,
                     "fmt": "105.59M",
                     "longFmt": "105,586,000"
                },
                "earnings": {
                     "raw": -8281000,
                     "fmt": "-8.28M",
                     "longFmt": "-8,281,000"
                }
           },
           {
                "date": 2017,
                "revenue": {
                     "raw": 143803000,
                     "fmt": "143.8M",
                     "longFmt": "143,803,000"
                },
                "earnings": {
                     "raw": 9716000,
                     "fmt": "9.72M",
                     "longFmt": "9,716,000"
                }
           },
           {
                "date": 2018,
                "revenue": {
                     "raw": 190071000,
                     "fmt": "190.07M",
                     "longFmt": "190,071,000"
                },
                "earnings": {
                     "raw": 19967000,
                     "fmt": "19.97M",
                     "longFmt": "19,967,000"
                }
           }
      ],
      "quarterly": [
           {
                "date": "1Q2018",
                "revenue": {
                     "raw": 42340000,
                     "fmt": "42.34M",
                     "longFmt": "42,340,000"
                },
                "earnings": {
                     "raw": 4320000,
                     "fmt": "4.32M",
                     "longFmt": "4,320,000"
                }
           },
           {
                "date": "2Q2018",
                "revenue": {
                     "raw": 47240000,
                     "fmt": "47.24M",
                     "longFmt": "47,240,000"
                },
                "earnings": {
                     "raw": 7474000,
                     "fmt": "7.47M",
                     "longFmt": "7,474,000"
                }
           },
           {
                "date": "3Q2018",
                "revenue": {
                     "raw": 50126000,
                     "fmt": "50.13M",
                     "longFmt": "50,126,000"
                },
                "earnings": {
                     "raw": 5524000,
                     "fmt": "5.52M",
                     "longFmt": "5,524,000"
                }
           },
           {
                "date": "4Q2018",
                "revenue": {
                     "raw": 50365000,
                     "fmt": "50.37M",
                     "longFmt": "50,365,000"
                },
                "earnings": {
                     "raw": 2649000,
                     "fmt": "2.65M",
                     "longFmt": "2,649,000"
                }
           }
      ]
 },
 "financialCurrency": "USD"}

如您所见,JSON 嵌套在字典顶层的一些元数据中,使用 pandas.io.json_normalize 之类的东西很容易阅读。

df = pd.io.json.json_normalize(earnings_dict)

df
Out[13]: 
  earningsChart.currentQuarterEstimate.fmt  ...                             financialsChart.yearly
0                                     0.15  ...  [{'date': 2015, 'revenue': {'raw': 74977000, '...

[1 rows x 9 columns]

但是,它错过了包含多年和季度收益数据的嵌套字典列表。例如。季度和年度列表只是作为字典列表添加到 Dataframe 中。

我认为这最初是几个带有外键的 SQL 表。

我已经阅读了json_normalize文档,但似乎无法弄清楚如何使用 record_path 和 meta 参数解析字典。

我想我可以使用 json_normalize 甚至从嵌套的多个级别的字典中创建 DataFrame。看起来我至少需要 5 个——一个用于元数据,4 个用于 2 个年度和年度表。

奖金:

您将如何存储它?您会将其存储在 NoSQL 字符串数据库中还是将其保存在 SQL 中?我的要求是进行相当低负载、轻量级的分析,这将需要一些使用 pandas 和 matplotlib 的视图和图形。

感谢您的帮助!

标签: pythonjsonpandas

解决方案


推荐阅读