首页 > 解决方案 > 将嵌套的 json 响应标准化为具有任意嵌套级别且键不一致的数据帧

问题描述

我正在努力将 JSON 响应转换为我想用于各种其他操作的 pandas 数据框。我已经尝试过这里列出的方法。但问题是我不能json_normalize有效地使用,因为如果我将所需的键作为record_path参数传递,它会抛出错误,因为只有某些字段有这个键而不是全部。我不想遍历整个 JSON 并一一比较键并重新创建我自己的字典对象。我想用uuid和获取数据框nice_to_have_skills,作为可以在 json 对象中找到这些属性的列nice_to_have_skills_path和键。nice_to_have_experiencenice_to_havenice_to_haveoperands

示例 JSON 响应在这里

{   "hits": [
{
  "uuid": "00000000-0000-0000-0000-000000000000",
  "text_about": "some_text",
  "objectID": "00000000-0000-0000-0000-000000000000-text_about"
},
{
  "uuid": "00000000-0000-0000-0000-000000000000",
  "search_space": "some json object",
  "is_searchspace": true,
  "objectID": "00000000-0000-0000-0000-000000000000-search_space"
},
{
  "uuid": "00000000-0000-0000-0000-000000000000",
  "nice_to_have": [
    {
      "operator": "AND",
      "operands": [
        {
          "category": "Skill",
          "values": [
            {
              "value": "MySQL ",
              "clusters": []
            }
          ]
        }
      ]
    }
  ],
  "objectID": "00000000-0000-0000-0000-000000000000-nice_to_have"
},
{
  "uuid": "00000000-0000-0000-0000-000000000000",
  "must_have": [
    {
      "operator": "AND",
      "operands": [
        {
          "category": "Skill",
          "values": [
            {
              "value": "React ",
              "clusters": []
            },
            {
              "value": "Node.js ",
              "clusters": []
            },
            {
              "value": "WebSocket ",
              "clusters": []
            },
            {
              "value": "JavaScript ",
              "clusters": []
            }
          ]
        },
        {
          "category": "Experience",
          "values": [
            {
              "value": "3 yrs",
              "clusters": []
            }
          ]
        },
        {
          "operator": "OR",
          "operands": [
            {
              "category": "Location",
              "values": [
                {
                  "value": "Toronto, Ontario, Canada (No Visa Sponsorship)",
                  "clusters": []
                }
              ]
            },
            {
              "category": "Remote",
              "values": [
                {
                  "value": "Anywhere in the world",
                  "clusters": []
                }
              ]
            }
          ]
        }
      ]
    }
  ],
  "objectID": "00000000-0000-0000-0000-000000000000-must_have"
},
{"some key":"some json object"},
{
  "uuid": "10000000-0000-0000-0000-000000000001",
  "text_about": "some text",
  "objectID": "10000000-0000-0000-0000-000000000001-text_about"
},
{
  "uuid": "10000000-0000-0000-0000-000000000001",
  "search_space": "some json object",
  "is_searchspace": true,
  "objectID": "10000000-0000-0000-0000-000000000001-search_space"
},
{
  "uuid": "10000000-0000-0000-0000-000000000001",
  "nice_to_have": [
    {
      "operator": "AND",
      "operands": [
        {
          "category": "Skill",
          "values": [
            {
              "value": "Frontend Programming Language ",
              "clusters": [
                {
                  "key": "Programming Language~>Frontend Programming Language",
                  "name": "Frontend Programming Language",
                  "path": [
                    "Programming Language",
                    "Frontend Programming Language"
                  ],
                  "uuid": "e8c5cc6c-d92b-4098-8965-41e6818fe337",
                  "category": "skill",
                  "pretty_lineage": [
                    "Programming Language"
                  ]
                }
              ]
            }
          ]
        }
      ]
    }
  ],
  "objectID": "10000000-0000-0000-0000-000000000001-nice_to_have"
},
{
  "uuid": "10000000-0000-0000-0000-000000000001",
  "must_have": [
    {
      "operator": "AND",
      "operands": [
        {
          "category": "Remote",
          "values": [
            {
              "value": "New Delhi (UTC+0530) 9am-6pm | 100.0% overlap",
              "clusters": []
            }
          ]
        },
        {
          "category": "Skill",
          "values": [
            {
              "value": "User Research ",
              "clusters": []
            },
            {
              "value": "Wireframing / Prototyping ",
              "clusters": [
                {
                  "key": "Wireframing / Prototyping",
                  "name": "Wireframing / Prototyping",
                  "path": [
                    "Wireframing / Prototyping"
                  ],
                  "uuid": "39fb1722-f522-4c66-90c5-5ee1cc525ff4",
                  "category": "skill",
                  "pretty_lineage": []
                }
              ]
            }
          ]
        },
        {
          "category": "Experience",
          "values": [
            {
              "value": "2 yrs",
              "clusters": []
            }
          ]
        },
        {
          "category": "Other",
          "values": [
            {
              "value": "Immediate Joiners Only",
              "clusters": []
            }
          ]
        }
      ]
    }
  ],
  "objectID": "10000000-0000-0000-0000-000000000001-must_have"
},
{"some_key":"some json object"}   ] }

我想"nice_to_have_skill" -> ["user research", "Wireframing / Prototyping"]在我的数据框中像这样提取nice_to_have_skill列名,["user research", "Wireframing / Prototyping"]并将成为该列中的值。

编辑:如果 JSON 具有任意深度,如何处理它?例如

{"nice_to_have": [{"operator": "AND", "operands": [{"operator": "OR", "operands": [{"category": "Language", "values": [{" value": " Korean", "clusters": []}]}]}]}], "company_name": "Framework", "company_role": ["Manufacturing", "Supply Chain/Procurement"]} 是一部分JSON 并且可以有任何级别的嵌套。

标签: pythonjsonpandas

解决方案


传递d['hits']json_normalize结果:

d = json.loads(json_text)

In [136]: %time pd.json_normalize(d['hits'])                                                                                                                                                                                                                                       
CPU times: user 2.1 ms, sys: 41 µs, total: 2.14 ms
Wall time: 2.12 ms
Out[136]: 
                                   uuid text_about                                           objectID      search_space is_searchspace                                       nice_to_have                                          must_have          some key          some_key
0  00000000-0000-0000-0000-000000000000  some_text    00000000-0000-0000-0000-000000000000-text_about               NaN            NaN                                                NaN                                                NaN               NaN               NaN
1  00000000-0000-0000-0000-000000000000        NaN  00000000-0000-0000-0000-000000000000-search_space  some json object           True                                                NaN                                                NaN               NaN               NaN
2  00000000-0000-0000-0000-000000000000        NaN  00000000-0000-0000-0000-000000000000-nice_to_have               NaN            NaN  [{'operator': 'AND', 'operands': [{'category':...                                                NaN               NaN               NaN
3  00000000-0000-0000-0000-000000000000        NaN     00000000-0000-0000-0000-000000000000-must_have               NaN            NaN                                                NaN  [{'operator': 'AND', 'operands': [{'category':...               NaN               NaN
4                                   NaN        NaN                                                NaN               NaN            NaN                                                NaN                                                NaN  some json object               NaN
5  10000000-0000-0000-0000-000000000001  some text    10000000-0000-0000-0000-000000000001-text_about               NaN            NaN                                                NaN                                                NaN               NaN               NaN
6  10000000-0000-0000-0000-000000000001        NaN  10000000-0000-0000-0000-000000000001-search_space  some json object           True                                                NaN                                                NaN               NaN               NaN
7  10000000-0000-0000-0000-000000000001        NaN  10000000-0000-0000-0000-000000000001-nice_to_have               NaN            NaN  [{'operator': 'AND', 'operands': [{'category':...                                                NaN               NaN               NaN
8  10000000-0000-0000-0000-000000000001        NaN     10000000-0000-0000-0000-000000000001-must_have               NaN            NaN                                                NaN  [{'operator': 'AND', 'operands': [{'category':...               NaN               NaN
9                                   NaN        NaN                                                NaN               NaN            NaN                                                NaN                                                NaN               NaN  some json object

在那里你可以选择nice_to_have

df = pd.json_normalize(d, record_path=['hits'])

In [263]: %time df['nice_to_have'].dropna().sum()                                                                                                                                                                                                                                  
CPU times: user 705 µs, sys: 11 µs, total: 716 µs
Wall time: 713 µs
Out[263]: 
[{'operator': 'AND',
  'operands': [{'category': 'Skill',
    'values': [{'value': 'MySQL ', 'clusters': []}]}]},
 {'operator': 'AND',
  'operands': [{'category': 'Skill',
    'values': [{'value': 'Frontend Programming Language ',
      'clusters': [{'key': 'Programming Language~>Frontend Programming Language',
        'name': 'Frontend Programming Language',
        'path': ['Programming Language', 'Frontend Programming Language'],
        'uuid': 'e8c5cc6c-d92b-4098-8965-41e6818fe337',
        'category': 'skill',
        'pretty_lineage': ['Programming Language']}]}]}]}]

希望这有帮助。

编辑:

回复您的评论:此 json 的主要问题是级别不一致,因此无法执行规范化并引发 KeyError。

获得的解决方法nice_to_have

f = list(filter(lambda x: 'nice_to_have' in x, d['hits']))  

>> pd.json_normalize(f, ['nice_to_have', 'operands', 'values', 'clusters'])

                                                 key                           name                                               path                                  uuid category          pretty_lineage
0  Programming Language~>Frontend Programming Lan...  Frontend Programming Language  [Programming Language, Frontend Programming La...  e8c5cc6c-d92b-4098-8965-41e6818fe337    skill  [Programming Language]

从那里你可以得到你想要得到的值。可以应用类似的解决方法来获取must_have.


推荐阅读