首页 > 解决方案 > 按子json的元素聚合json

问题描述

我有以下结构:

[
    {
        "Name": "a-1",
        "Tags": [
            {
                "Value": "a", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-02-25T17:33:19.000Z"
    },
    {
        "Name": "a-2",
        "Tags": [
            {
                "Value": "a", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-02-26T17:33:19.000Z"
    },
    {
        "Name": "b-1",
        "Tags": [
            {
                "Value": "b", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-01-21T17:33:19.000Z"
    },
    {
        "Name": "b-2",
        "Tags": [
            {
                "Value": "b", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-01-22T17:33:19.000Z"
    },
    {
        "Name": "c-1",
        "Tags": [
            {
                "Value": "c", 
                "Key": "Type"
            }
        ], 
        "CreationDate": "2018-08-29T17:33:19.000Z"
    }
]

当组中有多个成员时,我想打印出Name每个成员中最旧的(这应该是可配置的。例如:当成员超过 y 时,x 最旧的项目)。Value在这种情况下,有两个a,两个b和一个c,所以预期的结果将是:

 a-1
 b-1

如果我的 Python 代码在这里:

data = ec2.describe_images(Owners=['11111'])
images = data['Images']
grouper = groupby(map(itemgetter('Tags'), images))
groups = (list(vals) for _, vals in grouper)
res = list(chain.from_iterable(filter(None, groups)))

当前res仅包含 and 的列表,Key并且Value未分组。任何人都可以向我展示如何将代码继续到预期的结果?

标签: pythonjson

解决方案


这是一个使用 pandas 的解决方案,它需要一个 json 字符串作为输入 ( json_string)

很多时候 pandas 是矫枉过正的,但在这里我认为它会很好,因为你基本上想按价值分组,然后根据他们有多少成员等标准消除一些组

import pandas as pd

# load the dataframe from the json string
df = pd.read_json(json_string)
df['CreationDate'] = pd.to_datetime(df['CreationDate'])

# create a value column from the nested tags column
df['Value'] = df['Tags'].apply(lambda x: x[0]['Value'])

# groupby value and iterate through groups
groups = df.groupby('Value')
output = []
for name, group in groups:
    # skip groups with fewer than 2 members
    if group.shape[0] < 2:
        continue

    # sort rows by creation date
    group = group.sort_values('CreationDate')

    # save the row with the most recent date
    most_recent_from_group = group.iloc[0]
    output.append(most_recent_from_group['Name'])

print(output)

推荐阅读