首页 > 解决方案 > 从python中的嵌套结构中构建一个带有pandas的数据框

问题描述

我想用一个有点太复杂的数据集来实现机器学习。我想和熊猫一起工作,然后使用一些内置的模型在skit-learn中。

数据在 JSON 文件中给出,示例如下所示:

{
  "demo_Profile": {
    "sex": "male",
    "age": 98,
    "height": 160,
    "weight": 139,
    "bmi": 5,
    "someinfo1": [
      "some_more_info1"
    ],
    "someinfo2": [
      "some_more_inf2"
    ],
    "someinfo3": [
      "some_more_info3"
    ],
  },
  "event": {
    "info_personal": {
      "info1": 219.59,
      "info2": 129.18,
      "info3": 41.15,
      "info4": 94.19,
    },
    "symptoms": [
      {
        "name": "name1",
        "socrates": {
          "associations": [
            "associations1"
          ],
          "onsetType": "onsetType1",
          "timeCourse": "timeCourse1"
        }
      },
      {
        "name": "name2",
        "socrates": {
          "timeCourse": "timeCourse2"
        }
      },
      {
        "name": "name3",
        "socrates": {
          "onsetType": "onsetType2"
        }
      },
      {
        "name": "name4",
        "socrates": {
          "onsetType": "onsetType3"
        }
      },
      {
        "name": "name5",
        "socrates": {
          "associations": [
            "associations2"
          ]
        }
      }
    ],
    "labs": [
      {
        "name": "name1 ",
        "value": "valuelab"
      }
    ]
  }
}

我想创建一个考虑这种“嵌套数据”的熊猫数据框,但我不知道如何构建一个除了“单个参数”之外还考虑“嵌套参数”的数据框

例如,我不知道如何将包含“单个参数”的“demo_Profile”与症状合并,症状是字典列表,在相同情况下是单个值,在其他情况下是列表。

任何人都知道任何方法来处理这个问题?

编辑*********

上面显示的 JSON 只是一个示例,在其他情况下,列表中的值的数量以及症状的数量会有所不同。因此,上面显示的示例并非对每种情况都是固定的。

标签: pythonjsonpandasnestedstructure

解决方案


考虑熊猫的json_normalize。然而,因为有更深的嵌套,请考虑单独处理数据,然后在“规范化”列上使用填充连接在一起。

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('myfile.json', 'r') as f:
    data = json.loads(f.read()) 

final_df = pd.concat([json_normalize(data['demo_Profile']), 
                      json_normalize(data['event']['symptoms']), 
                      json_normalize(data['event']['info_personal']), 
                      json_normalize(data['event']['labs'])], axis=1)

# FLATTEN NESTED LISTS
n_list = ['someinfo1', 'someinfo2', 'someinfo3', 'socrates.associations']

final_df[n_list] = final_df[n_list].apply(lambda col: 
                     col.apply(lambda x: x  if pd.isnull(x) else x[0]))

# FILLING FORWARD
norm_list = ['age', 'bmi', 'height', 'weight', 'sex', 'someinfo1', 'someinfo2', 'someinfo3', 
             'info1', 'info2', 'info3', 'info4', 'name', 'value']

final_df[norm_list] = final_df[norm_list].ffill()  

输出

print(final_df)

#     age  bmi  height   sex        someinfo1       someinfo2        someinfo3  weight   name socrates.associations socrates.onsetType socrates.timeCourse   info1   info2  info3  info4    name     value
# 0  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name1         associations1         onsetType1         timeCourse1  219.59  129.18  41.15  94.19  name1   valuelab
# 1  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name2                   NaN                NaN         timeCourse2  219.59  129.18  41.15  94.19  name1   valuelab
# 2  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name3                   NaN         onsetType2                 NaN  219.59  129.18  41.15  94.19  name1   valuelab
# 3  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name4                   NaN         onsetType3                 NaN  219.59  129.18  41.15  94.19  name1   valuelab
# 4  98.0  5.0   160.0  male  some_more_info1  some_more_inf2  some_more_info3   139.0  name5         associations2                NaN                 NaN  219.59  129.18  41.15  94.19  name1   valuelab

推荐阅读