json - 如果父元素不是最高元素,如何访问父 JSON 记录和子 JSON 记录?
问题描述
我正在尝试使用 Pandas 加载 SQuAD 数据集。我的数据集中的 JSON 元素的结构是这样的,其中以“s”结尾的所有内容都代表一个列表:
-data
-- title
-- paragraphs
-- context
--- qas
---- id
---- question
----- answers
------ answerStart
------ answerText
我想创建一个看起来像这样的 DataFrame:
问题标题上下文答案文本
但是,每个问题我只想要一个“answerText”值,这意味着每个“qas”字段只有一个答案。由于“qas”的 id 对每一对都是唯一的,因此最好创建一个“answers”数据框,然后创建另一个如下所示的数据框:
qas_id answer_id
但是,我不太确定如何最好地设置此架构。这是我尝试过的:
with open(filename) as file:
data = json.load(file)["data"]
questions = pd.io.json.json_normalize(data,record_path=["paragraphs","qas","question"],meta=["paragraphs","qas","id"])
answers = pd.io.json.json_normalize(data,record_path=["paragraphs","qas","answers"],meta=["paragraphs","qas","id"])
由于元显然只允许访问顶部元素的子元素,我如何创建一个包含“qas”的“id”元素和答案的“answerStart”和“answerText”元素的数据框?
解决方案
我相信我有一个可行的解决方案:
import json
import re
import string
import pandas as pd
def readFile(filename):
with open(filename) as file:
data = json.load(file)["data"]
qas = pd.io.json.json_normalize(data,record_path=["paragraphs","qas"],meta=["title"])
#print(qas["question"])
#Gather a list of where all answers should be so we can shove them into a DataFrame.
# Haven't found a more efficient way to do this yet.
answer_ids = set()
answerId = 0
for index,row in qas.iterrows():
answer_ids.add(answerId)
answerId = answerId + len(row["answers"])
print("Finished with answer ids.")
# Map qas pair IDs to answer IDs.
answer_ids = pd.DataFrame(list(answer_ids))
print("Finished converting answer_ids to DataFrame.")
question_answerId = pd.DataFrame(qas["question"]).join(answer_ids,how="outer")
question_answerId.columns = ["question","answer_id"]
#print("Id-answerID columns: ",id_answerId.columns)
print("finished creating intermediary table.")
# Load answers into a data frame.
answers = pd.io.json.json_normalize(data,record_path=["paragraphs","qas","answers"])
answers.rename(columns={"text":"answer_text"},inplace=True)
# Give each answer an ID.
answers["id"] = answers.index
print("Finished creating answers dataframe.")
qas = qas.drop(labels=["answers"],axis=1) # Not needed any longer; we have the answers!
#print("Dropped column 'answers' from qas.")
# Map qas dataframe to answer table via id_answerId
qas_answerId = pd.merge(qas,question_answerId,how="inner",on="question")
# Check that no duplicates exist in qas_answerId
qas_answerId = qas_answerId.drop_duplicates("question")
assert qas_answerId.duplicated("question").any() == False
print("Finished joining qas to answer id")
# Merge qas_answerId with answers.
returnDataFrame = pd.merge(qas_answerId,answers,how="inner",left_on="answer_id",right_on="id")
#print("Returned data frame: ",returnDataFrame)
print("Done!")
return returnDataFrame
推荐阅读
- asp.net-mvc - ASP.NET MVC 中的模型缺少定义错误
- java - 使用邻接表检查有向图是否强连接
- python - 如何通过 Python 中的值访问键?
- flutter - 如何允许延迟关键字?
- c# - EpPlus保存更改后excel中的奇怪单词“x000D”
- python - 有人知道如何使用 PIAsync 吗?
- postgresql - 如何减少 Postgres 集群中的 max_connections 值?
- c# - 如何使用 NAudio 应用音频均衡
- reactjs - 在雷达标签中添加不同颜色的数值
- sql - 如何在全文搜索中使用运算符(AND、OR、NOT 和 THE)作为关键字进行搜索