json - 将 Pandas DataFrame 转换为多个嵌套的 JSON
问题描述
我正在处理一个要转换为嵌套 JSON 的 DataFrame(从 .csv 导入),但我无法创建额外的嵌套级别。我试图用一个例子来解释。在工作的最后,使用 pymongo 将结果导入 MongoDB。
----------------------------------------------------
worker_id | gender | employer_id | year | job_type |
----------------------------------------------------
WORK_1 | M | EMPL_2 | 1990 | Att |
----------------------------------------------------
WORK_1 | M | EMPL_1 | 1991 | Mis |
----------------------------------------------------
WORK_1 | M | EMPL_1 | 1993 | Att |
----------------------------------------------------
WORK_2 | F | EMPL_3 | 1995 | Att |
----------------------------------------------------
WORK_2 | F | EMPL_3 | 1992 | Mis |
----------------------------------------------------
WORK_2 | F | EMPL_3 | 1994 | Att |
----------------------------------------------------
df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
'gender':['M','M','M','F','F','F'],
'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
'year':[1990,1991,1993,1995,1992,1994],
'job_type':['Att','Mis','Att','Att','Mis','Att']
})
我想要获取的 JSON 应该遵循如下结构:
{ "worker_id": "WORK_1",
"gender": "M",
"job_type" : [
{ "Att": [
{
"employer_id": "EMPL_1",
"year": 1990
},
{
"employer_id": "EMPL_2",
"year": 1993
}
]
},
{ "Mis": [
{
"employer_id": "EMPL_1",
"year": 1991
}
]
}
]
},
{ "worker_id": "WORK_2",
"gender": "F",
"job_type" : [
{ "Att": [
{
"employer_id": "EMPL_3",
"year": 1994
},
{
"employer_id": "EMPL_3",
"year": 1995
}
]
},
{ "Mis": [
{
"employer_id": "EMPL_3",
"year": 1992
}
]
}
]
}
通过在 Stack Overflow 上进行一些有用的讨论,我能够在“job_type”的类别(数组)中为每个特定的工作合同嵌套一个对象(样本的每一行应该代表一个特定的工作合同,然后是几个其他变量) . 尽管如此,我还想区分作品的种类(在“Mis”和“Att”之间的示例中),然后创建另一个嵌套级别。
我用来嵌套在工人之间的工作合同中的代码如下。
finalList = []
finalDict = {}
grouped = df.groupby(['worker_id',
'gender'
])
for key, value in grouped:
dictionary = {}
j = grouped.get_group(key).reset_index(drop = True)
dictionary['worker_id'] = j.at[0, 'worker_id']
dictionary['gender'] = j.at[0, 'gender']
dictList = []
anotherDict = {}
for i in j.index:
anotherDict['employer_id'] = j.at[i, 'employer_id']
anotherDict['year'] = j.at[i, 'year']
anotherDict['job_type'] = j.at[i, 'job_type']
dictList.append(anotherDict.copy())
dictionary['job_type'] = dictList
finalList.append(dictionary)
我希望有人可以帮助我。先感谢您!
更新
我尝试使用下面的脚本来增强代码(我关注了这个线程)。不幸的是,我仍然没有得到我想要的。
# Generates a column for each kind of 'job_type'
df['att'] = ['Att' if x == 'Att' else None for x in df['job_type']]
df['mis'] = ['Mis' if x == 'Mis' else None for x in df['job_type']]
# Aggregate for the 'job_type' = 'Mis'
df_att = df.dropna(subset = ['att'])
df_att.drop(columns=['mis'])
att = (df_att.groupby(['worker_id','gender'], as_index = True)
.apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
.reset_index()
.rename(columns = {0:'Att'}))
# Aggregate for the 'job_type' = 'Som'
df_mis= df.dropna(subset = ['mis'])
df_mis.drop(columns=['att'])
mis = (df_mis.groupby(['worker_id','gender'], as_index = False)
.apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
.reset_index()
.rename(columns = {0:'Mis'}))
# Append
df_all = att.append(mis)
# Aggregate for 'worker_id' and 'gender'
j = (df_all.groupby(['worker_id','gender'], as_index = False)
.apply(lambda x: x[['Att','Mis']].to_dict('r'))
.reset_index()
.rename(columns = {0:'job_type'})
.to_json(orient = 'records'))
print(json.dumps(json.loads(j), indent = 4, sort_keys = True))
到目前为止我得到了什么......
[
{
"gender": "M",
"job_type": [
{
"Att": [
{
"employer_id": "EMPL_2",
"job_type": "Att",
"year": 1990
},
{
"employer_id": "EMPL_1",
"job_type": "Att",
"year": 1993
}
],
"Mis": null
},
{
"Att": null,
"Mis": [
{
"employer_id": "EMPL_1",
"job_type": "Mis",
"year": 1991
}
]
}
],
"worker_id": "WORK_1"
},
{
"gender": "F",
"job_type": [
{
"Att": [
{
"employer_id": "EMPL_3",
"job_type": "Att",
"year": 1995
},
{
"employer_id": "EMPL_3",
"job_type": "Att",
"year": 1994
}
],
"Mis": null
},
{
"Att": null,
"Mis": [
{
"employer_id": "EMPL_3",
"job_type": "Mis",
"year": 1992
}
]
}
],
"worker_id": "WORK_2"
}
]
解决方案
这是一个循环遍历唯一worker_id
值并为每个值建立字典列表的解决方案worker_id
:
import pandas as pd
import json
df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
'gender':['M','M','M','F','F','F'],
'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
'year':[1990,1991,1993,1995,1992,1994],
'job_type':['Att','Mis','Att','Att','Mis','Att']})
df_G=df[['worker_id','gender']].drop_duplicates()
all_dicts=[]
for indx,vals in df_G.iterrows():
this_dict=vals.to_dict()
job_dict=(df[df.worker_id==vals['worker_id']]
.groupby(['job_type']).apply(lambda x: x[['employer_id','year']]
.to_dict('r')).to_dict())
this_dict['job_type']=[]
for key,val in job_dict.items():
print({key:val})
this_dict['job_type'].append({key:val})
all_dicts.append(this_dict)
(df[df.worker_id==vals['worker_id']].groupby(['job_type']).apply(lambda x: x[['employer_id','year']].to_dict('r')))
print(json.dumps(all_dicts, indent = 4, sort_keys = True))
打印出来:
[
{
"gender": "M",
"job_type": [
{
"Mis": [
{
"employer_id": "EMPL_1",
"year": 1991
}
]
},
{
"Att": [
{
"employer_id": "EMPL_2",
"year": 1990
},
{
"employer_id": "EMPL_1",
"year": 1993
}
]
}
],
"worker_id": "WORK_1"
},
{
"gender": "F",
"job_type": [
{
"Mis": [
{
"employer_id": "EMPL_3",
"year": 1992
}
]
},
{
"Att": [
{
"employer_id": "EMPL_3",
"year": 1995
},
{
"employer_id": "EMPL_3",
"year": 1994
}
]
}
],
"worker_id": "WORK_2"
}
]
也许不是最有效的或 Pythonic,但它确实有效。如果我没记错 pymongo,您可以将要插入的字典列表传递给它。
推荐阅读
- python - 运行 Anaconda 导航器的终端中的 AttributeError (ubuntu 18.04)
- php - preg_match 如何返回匹配项?
- loops - Ansible 2.7.5 | 错误!'set_fact' 不是 Play 的有效属性
- c# - Unity:场景更改后更改滑块值保存
- angular - 问题打字稿执行顺序失败
- apache-flink - 使用双键删除状态时遇到问题
- d3.js - d3js 无法附加到 clipPath
- javascript - 创建的节点没有 attachEvent 属性
- python - JSON 中的决策树 - 将叶子返回给给定叶子的根路径
- node.js - 使用 Mocha 和 chai 库测试 nodejs 方法