首页 > 解决方案 > 将 Pandas DataFrame 转换为多个嵌套的 JSON

问题描述

我正在处理一个要转换为嵌套 JSON 的 DataFrame(从 .csv 导入),但我无法创建额外的嵌套级别。我试图用一个例子来解释。在工作的最后,使用 pymongo 将结果导入 MongoDB。

----------------------------------------------------
worker_id | gender | employer_id | year | job_type |
----------------------------------------------------
WORK_1    | M      | EMPL_2      | 1990 | Att      |
----------------------------------------------------   
WORK_1    | M      | EMPL_1      | 1991 | Mis      |
---------------------------------------------------- 
WORK_1    | M      | EMPL_1      | 1993 | Att      |
----------------------------------------------------     
WORK_2    | F      | EMPL_3      | 1995 | Att      |
----------------------------------------------------  
WORK_2    | F      | EMPL_3      | 1992 | Mis      |
---------------------------------------------------- 
WORK_2    | F      | EMPL_3      | 1994 | Att      |
----------------------------------------------------  

df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
                    'gender':['M','M','M','F','F','F'],
                    'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
                    'year':[1990,1991,1993,1995,1992,1994],
                    'job_type':['Att','Mis','Att','Att','Mis','Att']
                    })

我想要获取的 JSON 应该遵循如下结构:

{ "worker_id": "WORK_1",
  "gender": "M",
  "job_type" : [
               { "Att": [
                        { 
                          "employer_id": "EMPL_1",
                          "year": 1990
                        },
                        { 
                          "employer_id": "EMPL_2",
                          "year": 1993
                        }
                        ]      
               },      
               { "Mis": [
                        { 
                          "employer_id": "EMPL_1",
                          "year": 1991
                        }
                        ]      
               } 
               ]
},
{ "worker_id": "WORK_2",
  "gender": "F",
  "job_type" : [
               { "Att": [
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1994
                        },
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1995
                        }
                        ]      
               },      
               { "Mis": [
                        { 
                          "employer_id": "EMPL_3",
                          "year": 1992
                        }
                        ]      
               } 
               ]
}

通过在 Stack Overflow 上进行一些有用的讨论,我能够在“job_type”的类别(数组)中为每个特定的工作合同嵌套一个对象(样本的每一行应该代表一个特定的工作合同,然后是几个其他变量) . 尽管如此,我还想区分作品的种类(在“Mis”和“Att”之间的示例中),然后创建另一个嵌套级别。

我用来嵌套在工人之间的工作合同中的代码如下。

finalList = []
finalDict = {}

grouped = df.groupby(['worker_id',
                      'gender'
                     ])
for key, value in grouped:

    dictionary = {}

    j = grouped.get_group(key).reset_index(drop = True)
    dictionary['worker_id'] = j.at[0, 'worker_id']
    dictionary['gender'] = j.at[0, 'gender']

    dictList = []
    anotherDict = {}

    for i in j.index:

        anotherDict['employer_id'] = j.at[i, 'employer_id']       
        anotherDict['year'] = j.at[i, 'year']
        anotherDict['job_type'] = j.at[i, 'job_type']

        dictList.append(anotherDict.copy())

    dictionary['job_type'] = dictList

    finalList.append(dictionary)

我希望有人可以帮助我。先感谢您!

更新

我尝试使用下面的脚本来增强代码(我关注了这个线程)。不幸的是,我仍然没有得到我想要的。

# Generates a column for each kind of 'job_type'
df['att'] = ['Att' if x == 'Att' else None for x in df['job_type']] 
df['mis'] = ['Mis' if x == 'Mis' else None for x in df['job_type']] 

# Aggregate for the 'job_type' = 'Mis'
df_att = df.dropna(subset = ['att'])
df_att.drop(columns=['mis'])            
att = (df_att.groupby(['worker_id','gender'], as_index = True)
             .apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'Att'}))

# Aggregate for the 'job_type' = 'Som'
df_mis= df.dropna(subset = ['mis']) 
df_mis.drop(columns=['att'])
mis = (df_mis.groupby(['worker_id','gender'], as_index = False)
             .apply(lambda x: x[['employer_id','year','job_type']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'Mis'}))

# Append
df_all = att.append(mis)

# Aggregate for 'worker_id' and 'gender'
j = (df_all.groupby(['worker_id','gender'], as_index = False)
             .apply(lambda x: x[['Att','Mis']].to_dict('r'))
             .reset_index()
             .rename(columns = {0:'job_type'})
             .to_json(orient = 'records'))

print(json.dumps(json.loads(j), indent = 4, sort_keys = True))

到目前为止我得到了什么......

[
    {
        "gender": "M",
        "job_type": [
            {
                "Att": [
                    {
                        "employer_id": "EMPL_2",
                        "job_type": "Att",
                        "year": 1990
                    },
                    {
                        "employer_id": "EMPL_1",
                        "job_type": "Att",
                        "year": 1993
                    }
                ],
                "Mis": null
            },
            {
                "Att": null,
                "Mis": [
                    {
                        "employer_id": "EMPL_1",
                        "job_type": "Mis",
                        "year": 1991
                    }
                ]
            }
        ],
        "worker_id": "WORK_1"
    },
    {
        "gender": "F",
        "job_type": [
            {
                "Att": [
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Att",
                        "year": 1995
                    },
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Att",
                        "year": 1994
                    }
                ],
                "Mis": null
            },
            {
                "Att": null,
                "Mis": [
                    {
                        "employer_id": "EMPL_3",
                        "job_type": "Mis",
                        "year": 1992
                    }
                ]
            }
        ],
        "worker_id": "WORK_2"
    }
]

标签: jsonpython-3.xpandaspandas-groupbynested-loops

解决方案


这是一个循环遍历唯一worker_id值并为每个值建立字典列表的解决方案worker_id

import pandas as pd
import json
df = pd.DataFrame({'worker_id':['WORK_1','WORK_1','WORK_1','WORK_2','WORK_2','WORK_2'],
                    'gender':['M','M','M','F','F','F'],
                    'employer_id':['EMPL_2','EMPL_1','EMPL_1','EMPL_3','EMPL_3','EMPL_3'],
                    'year':[1990,1991,1993,1995,1992,1994],
                    'job_type':['Att','Mis','Att','Att','Mis','Att']})

df_G=df[['worker_id','gender']].drop_duplicates()
all_dicts=[]
for indx,vals in df_G.iterrows():
    this_dict=vals.to_dict()
    job_dict=(df[df.worker_id==vals['worker_id']]
        .groupby(['job_type']).apply(lambda x: x[['employer_id','year']]
        .to_dict('r')).to_dict())
    this_dict['job_type']=[]
    for key,val in job_dict.items():         
        print({key:val})
        this_dict['job_type'].append({key:val})
    all_dicts.append(this_dict)

(df[df.worker_id==vals['worker_id']].groupby(['job_type']).apply(lambda x: x[['employer_id','year']].to_dict('r')))

print(json.dumps(all_dicts, indent = 4, sort_keys = True))

打印出来:

[
    {
        "gender": "M",
        "job_type": [
            {
                "Mis": [
                    {
                        "employer_id": "EMPL_1",
                        "year": 1991
                    }
                ]
            },
            {
                "Att": [
                    {
                        "employer_id": "EMPL_2",
                        "year": 1990
                    },
                    {
                        "employer_id": "EMPL_1",
                        "year": 1993
                    }
                ]
            }
        ],
        "worker_id": "WORK_1"
    },
    {
        "gender": "F",
        "job_type": [
            {
                "Mis": [
                    {
                        "employer_id": "EMPL_3",
                        "year": 1992
                    }
                ]
            },
            {
                "Att": [
                    {
                        "employer_id": "EMPL_3",
                        "year": 1995
                    },
                    {
                        "employer_id": "EMPL_3",
                        "year": 1994
                    }
                ]
            }
        ],
        "worker_id": "WORK_2"
    }
]

也许不是最有效的或 Pythonic,但它确实有效。如果我没记错 pymongo,您可以将要插入的字典列表传递给它。


推荐阅读