python - 解析 JSON Lines 文件
问题描述
我需要找到一种将 json 文件中的数据解析为 csv 或 xlsx 的方法。然而,我在线使用的每个JSON 验证器都会给我一个错误,提示 JSON 文件无效。
JSON 文件示例如下:
{"id": "someID1.docx",
"language": {"detected": "cs"},
"title": "Name - Title - FileName",
"text": "Long string of text",
"entities": [
{"standardForm": "Svářečský průkaz", "type": "car"},
{"standardForm": "email1@gmail.com", "type": "email"},
{"standardForm": "english", "type": "languages"},
{"standardForm": "Práce na PC", "type": "abilities"},
{"standardForm": "MS Office", "type": "abilities"},
{"standardForm": "Automechanik", "type": "education"},
{"standardForm": "Střední průmyslová škola", "type": "education"},
{"standardForm": "Angličtina-Němčina", "type": "languages"},
{"standardForm": "mechanic", "type": "position"},
{"standardForm": "Praha", "type": "region"},
{"standardForm": "B2 - středně pokročilý", "type": "en_level"},
{"standardForm": "Skupina B", "type": "drivinglicense"}
]}
{"id": "someID2.pdf",
"language": {"detected": "cs"},
"title": "Name - Title - FileName2",
"text": "Long string of text2",
"entities": [
{"standardForm": "german", "type": "languages"},
{"standardForm": "high school", "type": "education"},
{"standardForm": "Angličtina-Němčina", "type": "languages"},
{"standardForm": "driver", "type": "position"},
{"standardForm": "english", "type": "languages"},
{"standardForm": "university", "type": "education"},
{"standardForm": "email2@seznam.cz", "type": "email"},
{"standardForm": "Středočeský", "type": "region"},
{"standardForm": "Střední", "type": "edulevel"},
{"standardForm": "manager", "type": "lastposition"},
{"standardForm": "? – nerozpoznáno", "type": "de_level"},
{"standardForm": "? – nerozpoznáno", "type": "en_level"},
{"standardForm": "Skupina C", "type": "drivinglicense"}
]}
...
我可以在 Python 中加载这个 JSON:
import pandas as pd
jsonfile = [json.loads(line) for line in open('jsonfile.json', 'r', encoding='utf-8')]
但我无法以任何方式将其转换为 csv。我需要能够存储与所有 id 相关的所有实体,最好是在 csv 中。有什么办法吗?我需要 JSON 不同吗?
谢谢
编辑:我需要上面示例的 csv 输出如下:
ID;title;languages;education
someID1.docx;Name-Title-FileName;english,Angličtina-Němčina;Automechanik;Střední Prům. škola
seomeID2.pdf;Name-Title-FileName2; german,Angličtina-Němčina,english;high school, university
解决方案
因为你已经导入了 pandas,你可以使用它pandas.DataFrame
df = pd.DataFrame(jsonfile)
df['languages'] = df.apply(lambda x: [item['standardForm']
for item in x.entities
if item['type'] == 'languages'],
axis=1)
df['education'] = df.apply(lambda x: [item['standardForm']
for item in x.entities
if item['type'] == 'education'],
axis=1)
df.to_csv(<filename>, columns=['id', 'title', 'languages', 'education'])
推荐阅读
- android - 如何在 Android Studio 中使用 putFragment 和 getFragment
- assembly - MIPS 汇编错误的数字
- javascript - Dialogflow Fulfillment webhook 调用失败
- charts - 如何使用谷歌图表绘制给定数据表的折线图?
- android - android 应用程序安装,但我无法在 android 4.4.2 上打开
- snowflake-cloud-data-platform - 有没有办法获取有关取消查询的信息
- image-processing - 如何在 Sobel 上的彩色图像上应用边缘检测过滤器?
- python - 将 2 列合并为同一 df 中的一列,消除“0”值
- python-3.x - shutil.rmtree 不删除目录
- java - Java - Google Drive API v3 - 返回文件的特定字段 - 语法