python - 将嵌套的 json 响应标准化为具有任意嵌套级别且键不一致的数据帧
问题描述
我正在努力将 JSON 响应转换为我想用于各种其他操作的 pandas 数据框。我已经尝试过这里列出的方法。但问题是我不能json_normalize
有效地使用,因为如果我将所需的键作为record_path
参数传递,它会抛出错误,因为只有某些字段有这个键而不是全部。我不想遍历整个 JSON 并一一比较键并重新创建我自己的字典对象。我想用uuid
和获取数据框nice_to_have_skills
,作为可以在 json 对象中找到这些属性的列nice_to_have_skills_path
和键。nice_to_have_experience
nice_to_have
nice_to_have
operands
示例 JSON 响应在这里
{ "hits": [ { "uuid": "00000000-0000-0000-0000-000000000000", "text_about": "some_text", "objectID": "00000000-0000-0000-0000-000000000000-text_about" }, { "uuid": "00000000-0000-0000-0000-000000000000", "search_space": "some json object", "is_searchspace": true, "objectID": "00000000-0000-0000-0000-000000000000-search_space" }, { "uuid": "00000000-0000-0000-0000-000000000000", "nice_to_have": [ { "operator": "AND", "operands": [ { "category": "Skill", "values": [ { "value": "MySQL ", "clusters": [] } ] } ] } ], "objectID": "00000000-0000-0000-0000-000000000000-nice_to_have" }, { "uuid": "00000000-0000-0000-0000-000000000000", "must_have": [ { "operator": "AND", "operands": [ { "category": "Skill", "values": [ { "value": "React ", "clusters": [] }, { "value": "Node.js ", "clusters": [] }, { "value": "WebSocket ", "clusters": [] }, { "value": "JavaScript ", "clusters": [] } ] }, { "category": "Experience", "values": [ { "value": "3 yrs", "clusters": [] } ] }, { "operator": "OR", "operands": [ { "category": "Location", "values": [ { "value": "Toronto, Ontario, Canada (No Visa Sponsorship)", "clusters": [] } ] }, { "category": "Remote", "values": [ { "value": "Anywhere in the world", "clusters": [] } ] } ] } ] } ], "objectID": "00000000-0000-0000-0000-000000000000-must_have" }, {"some key":"some json object"}, { "uuid": "10000000-0000-0000-0000-000000000001", "text_about": "some text", "objectID": "10000000-0000-0000-0000-000000000001-text_about" }, { "uuid": "10000000-0000-0000-0000-000000000001", "search_space": "some json object", "is_searchspace": true, "objectID": "10000000-0000-0000-0000-000000000001-search_space" }, { "uuid": "10000000-0000-0000-0000-000000000001", "nice_to_have": [ { "operator": "AND", "operands": [ { "category": "Skill", "values": [ { "value": "Frontend Programming Language ", "clusters": [ { "key": "Programming Language~>Frontend Programming Language", "name": "Frontend Programming Language", "path": [ "Programming Language", "Frontend Programming Language" ], "uuid": "e8c5cc6c-d92b-4098-8965-41e6818fe337", "category": "skill", "pretty_lineage": [ "Programming Language" ] } ] } ] } ] } ], "objectID": "10000000-0000-0000-0000-000000000001-nice_to_have" }, { "uuid": "10000000-0000-0000-0000-000000000001", "must_have": [ { "operator": "AND", "operands": [ { "category": "Remote", "values": [ { "value": "New Delhi (UTC+0530) 9am-6pm | 100.0% overlap", "clusters": [] } ] }, { "category": "Skill", "values": [ { "value": "User Research ", "clusters": [] }, { "value": "Wireframing / Prototyping ", "clusters": [ { "key": "Wireframing / Prototyping", "name": "Wireframing / Prototyping", "path": [ "Wireframing / Prototyping" ], "uuid": "39fb1722-f522-4c66-90c5-5ee1cc525ff4", "category": "skill", "pretty_lineage": [] } ] } ] }, { "category": "Experience", "values": [ { "value": "2 yrs", "clusters": [] } ] }, { "category": "Other", "values": [ { "value": "Immediate Joiners Only", "clusters": [] } ] } ] } ], "objectID": "10000000-0000-0000-0000-000000000001-must_have" }, {"some_key":"some json object"} ] }
我想"nice_to_have_skill" -> ["user research", "Wireframing / Prototyping"]
在我的数据框中像这样提取nice_to_have_skill
列名,["user research", "Wireframing / Prototyping"]
并将成为该列中的值。
编辑:如果 JSON 具有任意深度,如何处理它?例如
{"nice_to_have": [{"operator": "AND", "operands": [{"operator": "OR", "operands": [{"category": "Language", "values": [{" value": " Korean", "clusters": []}]}]}]}], "company_name": "Framework", "company_role": ["Manufacturing", "Supply Chain/Procurement"]} 是一部分JSON 并且可以有任何级别的嵌套。
解决方案
传递d['hits']
给json_normalize
结果:
d = json.loads(json_text)
In [136]: %time pd.json_normalize(d['hits'])
CPU times: user 2.1 ms, sys: 41 µs, total: 2.14 ms
Wall time: 2.12 ms
Out[136]:
uuid text_about objectID search_space is_searchspace nice_to_have must_have some key some_key
0 00000000-0000-0000-0000-000000000000 some_text 00000000-0000-0000-0000-000000000000-text_about NaN NaN NaN NaN NaN NaN
1 00000000-0000-0000-0000-000000000000 NaN 00000000-0000-0000-0000-000000000000-search_space some json object True NaN NaN NaN NaN
2 00000000-0000-0000-0000-000000000000 NaN 00000000-0000-0000-0000-000000000000-nice_to_have NaN NaN [{'operator': 'AND', 'operands': [{'category':... NaN NaN NaN
3 00000000-0000-0000-0000-000000000000 NaN 00000000-0000-0000-0000-000000000000-must_have NaN NaN NaN [{'operator': 'AND', 'operands': [{'category':... NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN some json object NaN
5 10000000-0000-0000-0000-000000000001 some text 10000000-0000-0000-0000-000000000001-text_about NaN NaN NaN NaN NaN NaN
6 10000000-0000-0000-0000-000000000001 NaN 10000000-0000-0000-0000-000000000001-search_space some json object True NaN NaN NaN NaN
7 10000000-0000-0000-0000-000000000001 NaN 10000000-0000-0000-0000-000000000001-nice_to_have NaN NaN [{'operator': 'AND', 'operands': [{'category':... NaN NaN NaN
8 10000000-0000-0000-0000-000000000001 NaN 10000000-0000-0000-0000-000000000001-must_have NaN NaN NaN [{'operator': 'AND', 'operands': [{'category':... NaN NaN
9 NaN NaN NaN NaN NaN NaN NaN NaN some json object
在那里你可以选择nice_to_have
:
df = pd.json_normalize(d, record_path=['hits'])
In [263]: %time df['nice_to_have'].dropna().sum()
CPU times: user 705 µs, sys: 11 µs, total: 716 µs
Wall time: 713 µs
Out[263]:
[{'operator': 'AND',
'operands': [{'category': 'Skill',
'values': [{'value': 'MySQL ', 'clusters': []}]}]},
{'operator': 'AND',
'operands': [{'category': 'Skill',
'values': [{'value': 'Frontend Programming Language ',
'clusters': [{'key': 'Programming Language~>Frontend Programming Language',
'name': 'Frontend Programming Language',
'path': ['Programming Language', 'Frontend Programming Language'],
'uuid': 'e8c5cc6c-d92b-4098-8965-41e6818fe337',
'category': 'skill',
'pretty_lineage': ['Programming Language']}]}]}]}]
希望这有帮助。
编辑:
回复您的评论:此 json 的主要问题是级别不一致,因此无法执行规范化并引发 KeyError。
获得的解决方法nice_to_have
:
f = list(filter(lambda x: 'nice_to_have' in x, d['hits']))
>> pd.json_normalize(f, ['nice_to_have', 'operands', 'values', 'clusters'])
key name path uuid category pretty_lineage
0 Programming Language~>Frontend Programming Lan... Frontend Programming Language [Programming Language, Frontend Programming La... e8c5cc6c-d92b-4098-8965-41e6818fe337 skill [Programming Language]
从那里你可以得到你想要得到的值。可以应用类似的解决方法来获取must_have
.
推荐阅读
- react-native - WebView 不一致地显示白屏 react-native
- terraform - Terraform:depends_on 用于计数资源
- winapi - CLisp/FFI 在 win32 中崩溃,可能是因为垃圾收集
- flutter - 删除默认的材料设计填充
- java - 对于 Ignite,如何在不覆盖 Marshaller 的情况下在不同的包中使用相同的类?
- python - 如何使用python在文本文件中查找文本模式?
- kubernetes - Kubernetes nginx 入口控制器终止连接
- reactjs - 当我刷新 react.js 和 next.js 页面时,它给了我 404 错误
- java - 具有相同迭代的两种不同冒泡排序方法的时间复杂度
- python - 如何让字符串函数检查单个字符而不是整个字符串?