python - 如何处理大型 json 文件(将其展平为 tsv)
问题描述
我正在处理一个大型 JSON 文件,特别是角色数据集(在此处下载)
Persona-Chat 中的每个条目都是一个具有两个键的个性和话语的字典,数据集是一个条目列表。
personality: list of strings containing the personality of the agent utterances: list of dictionaries, each of which has two keys which are lists of strings. candidates: [next_utterance_candidate_1, ..., next_utterance_candidate_19] The last candidate is the ground truth response observed in the conversational data history: [dialog_turn_0, ... dialog_turn N], where N is an odd number since the other user starts every conversation.
https://towardsdatascience.com/how-to-train-your-chatbot-with-simple-transformers-da25160859f4
我想要实现的是将其展平并以以下格式将其转换为 tsv:
col_index, string (where string is the personality, candidates and history
但是每当我尝试加载它并将其转换为数据帧时
import pandas as pd
df = pd.read_json(r'path')
display(df)
我收到以下错误:
ValueError: arrays must all be same length
无论是文章还是其他库/框架和方法,甚至是面包屑,任何帮助都将不胜感激!
编辑:我将它提供给另一个需要 tsv 的 api,我正在考虑一种连接并保留结构以再次重新构建它的方法。
解决方案
要完全展平该文件,您需要类似
import json
def read_personachat_file(name="personachat_self_original.json"):
with open(name, "r") as f:
data = json.load(f)
for entry_type, chats in data.items():
for chat_id, chat in enumerate(chats):
personality = "|".join(chat["personality"])
for utt_id, utt in enumerate(chat["utterances"]):
for key in ("candidates", "history"):
for phrase_id, phrase in enumerate(utt[key]):
yield (entry_type, chat_id, personality, utt_id, key, phrase_id, phrase)
for entry in read_personachat_file():
print(entry)
输出将类似于
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 7, 'my sister will be my mom , she wants me to get married')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 8, 'hi , how are ya ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 9, 'sounds good . i am just sitting here with my dog . i love animals .')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 10, "sure i'll go with you but i am baking a pizza right now , my favorite . come eat .")
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 11, 'where do you work then soccer person ?')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 12, 'it is so pretty in the fall and winter , my favorite time to go')
('train', 313, 'i like to wear red .|i wear a red purse .|i like to wear red shoes also .|i use red lipstick .|i drive a red car .', 5, 'candidates', 13, 'i to travel and meet new people')
(无论这对您是否有用)。
推荐阅读
- bash - Bash:如何强制不忽略空字节
- entity-framework - 如何使用外键从另一个上下文中获取名称?
- toggle - SwiftUI - 所有变量都切换而不是突出一个
- python - 将 numpy float64 稀疏矩阵转换为 pandas 数据框
- javascript - Firebase - 从本地或生产环境域发送 verifyEmail 电子邮件
- vba - 从 ms access 2016 中的表单上的未绑定文本框中更新记录字段
- python - 如何使用 Skyfield 或 PyEphem 来确定过去五连词的日期?
- python - Sympy - 当表达式包含许多符号时,集成很慢
- javascript - 使用 dc.js 从存储在字典中的已处理数据创建条形图
- c# - 使用 ASP.NET Core 2.2 的页面名称