首页 > 解决方案 > Pandas 对具有多个级别的 dict 进行规范化

问题描述

我有一个清单dicts。每个dict都有嵌套dicts,这些dicts都有dicts. 这很复杂,所以我在下面举了一个例子:

dict列表中的一个样本dicts

{
    "episodeStream": {
        "episodes": [
            {
                "timestamp": 1629908450,
                "data": {
                    "id": "001",
                    "version": "1.02",
                    "componentVersions": {
                        "xcomponent": "1.00",
                        "ycomponent": "1.20"
                    },
                    "adresses": [],
                }
            },
            {
                "timestamp": 1629908932,
                "data": {
                    "links": [
                        {
                            "type": "type0",
                            "address": "ffeeddccbbaa",
                            "interfaceid": {
                                "xid": "123123",
                                "yid": "321321"
                            }
                        }
                    ],
                    "directLinkConnection": false
                }
            },
            ...
            ...
            ... more dicts where data field not necessarily same as above
            ...
            ...
        ],
        "deprecatedEpisodes": []
    },
    "header": {
        "namespace": "test",
        "message": "testmessage 123",
        "references": {}
    }
}

这是列表中的词典之一dicts。列表中有很多字典,[dict1, dict2, dict3, dict4, ...]每个字典都具有相同的结构。

我想将这些添加到 Pandas Dataframe 中。我不确定像这样的复杂字典如何成为数据框。如果我只使用 pd.from_records,列将是“episodeStream”“header”,而“episodeStream”和“header”字典中的所有内容都是它们的值。

因此pd.from_records,我尝试pd.json_normalize不使用其他参数,而不是 。列是:

episodeStream.episodes    episodeStream.deprecatedEpisodes    header.namespace    header.message    header.references

这不是我想要的,因为我还想要每一集的剧集内的时间戳和数据列。

我尝试将它与record_path参数一起使用,它给了我一个KeyError: 'episodes'因为record_path参数pd.json_normalize

records = [r for r in reader]        # records is a list of dicts
#df = pd.DataFrame.from_records(records)
df = pd.json_normalize(records, 'episodes')
return df

我想要一个如下所示的数据框: episodeStreamId 是列表中的第 th ,episodeId 是剧集列表中的第 th集字典dictdicts

     episodeStreamId    episodesId    timestamp    data
0    0                  0             1629908450   "id": "001", "version": "1.02", "componentVersions": {"xcomponent": "1.00", "ycomponent": "1.20"}, "adresses": []
1    0                  1             1629908932   "links": [{"type": "type0", "address": "ffeeddccbbaa", "interfaceid": {"xid": "123123", "yid": "321321"}
     ...
     ...
     ...
120  1                  0             1629914424   "id": "002", "version": "1.02", "componentVersions": {"xcomponent": "1.00", "ycomponent": "1.20"}, "adresses": []
121  1                  1             1629914459   "links": [{"type": "type0", "address": "ddffaaccbbee", "interfaceid": {"xid": "444333", "yid": "222333"}
     ...
     ...
     ...

您如何将这样的列表dicts转换为 Pandas DataFrame?

编辑:我的数据图:

在此处输入图像描述

标签: pythonpandasdataframedictionary

解决方案


假设您有records记录列表,请尝试:

import json
import pandas

records = [...]

data = (
    {
        'episodeStreamId': esid,
        'episodesId' : eid,
        'timestamp' : episode['timestamp'],
        'data': json.dumps(episode['data'])
    } 
    for eid, epistream in enumerate(records)
    for esid, episode in enumerate(epistream['episodeStream']['episodes']) 
)

df = pandas.DataFrame(data)
   episodeStreamId  episodesId   timestamp                                               data
0                0           0  1629908450  {"id": "001", "version": "1.02", "componentVer...
1                1           0  1629908932  {"links": [{"type": "type0", "address": "ffeed...

推荐阅读