首页 > 解决方案 > 根据 pandas 中的列和条件对值进行分组

问题描述

我想根据值是否在 +20 范围内的条件对熊猫数据框列进行分组。下面是数据框

{'Name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'},
 'ID': {0: 100, 1: 23, 2: 19, 3: 42, 4: 11, 5: 78},
 'Left': {0: 70, 1: 70, 2: 70, 3: 70, 4: 66, 5: 66},
 'Top': {0: 10, 1: 26, 2: 26, 3: 35, 4: 60, 5: 71}}

在这里,我想对列左和上进行分组。这就是我所做的:

df.groupby(['Top'],as_index=False).agg(lambda x: list(x))

这是我得到的结果:

    {'Top': {0: 10, 1: 26, 2: 35, 3: 60, 4: 71},
 'Name': {0: ['A'], 1: ['B', 'C'], 2: ['D'], 3: ['E'], 4: ['F']},
 'ID': {0: [100], 1: [23, 19], 2: [42], 3: [11], 4: [78]},
 'Left': {0: [70], 1: [70, 70], 2: [43], 3: [66], 4: [66]}}

期望的输出:

{'Top': {0: [10, 26], 2: 35, 3: [60,71]},
 'Name': {0: ['A', 'B', 'C'], 2: ['D'], 3: ['E', 'F']},
 'ID': {0: [100, 23, 19], 2: [42], 3: [11, 78]},
 'Left': {0: [70, 50, 87], 2: [43], 3: [66, 99]}}

笔记:

需要考虑的重要一点是,Top 值 10 和 26 在 20 的范围内,它形成了一个组。即使 26 和 35 之间的差值在 20 范围内,也不应该将 35 添加到组中,因为 10 和 20 已经在一个组中,并且 10(组中的最小值)和 35 之间的差不在范围为 20。

有没有其他方法可以解决这个问题?

编辑:

我有一个不同的用例,其最高值会增加,当它移动到新页面时,最高值会发生变化并再次开始增加。这适用于不同的输入。最后我想按输入文件名、页码和组进行分组。我怎样才能将这些分组?

{'Input File Name': {0: 268441,
  1: 268441,
  2: 268441,
  3: 268441,
  4: 268441,
  5: 268441,
  6: 268441,
  7: 268441,
  8: 268441,
  9: 268441,
  10: 268441,
  11: 268441,
  12: 268441,
  13: 268441,
  14: 268441,
  15: 268441,
  16: 268441,
  17: 268441,
  18: 268441,
  19: 268441,
  20: 268441,
  21: 268441,
  22: 268441,
  23: 268441,
  24: 268441,
  25: 268441,
  26: 268441,
  27: 268441,
  28: 268441,
  29: 268441,
  30: 268441,
  31: 268441,
  32: 268441,
  33: 268441,
  34: 268441,
  35: 268441,
  36: 268441,
  37: 268441,
  38: 268441,
  39: 268441},
 'Page Number': {0: 1,
  1: 1,
  2: 1,
  3: 1,
  4: 1,
  5: 1,
  6: 1,
  7: 1,
  8: 1,
  9: 1,
  10: 1,
  11: 1,
  12: 1,
  13: 1,
  14: 1,
  15: 1,
  16: 1,
  17: 1,
  18: 1,
  19: 1,
  20: 2,
  21: 2,
  22: 2,
  23: 2,
  24: 2,
  25: 2,
  26: 2,
  27: 2,
  28: 2,
  29: 2,
  30: 2,
  31: 2,
  32: 2,
  33: 2,
  34: 2,
  35: 2,
  36: 2,
  37: 2,
  38: 2,
  39: 2},
 'Content': {0: '3708 Forestview Road',
  1: 'AvailableForLease&Sale',
  2: '1,700± SFMedicalOffice',
  3: '3708ForestviewRoad',
  4: 'Suite107',
  5: 'Raleigh,NC27612',
  6: 'BuildingDescription',
  7: '22,278± SFClassAOfficeBuilding',
  8: 'OnlyOneSuiteLeft toLeaseand/orPurchase',
  9: '(1)1,700± SFShell',
  10: 'FlexibleLeaseTerms',
  11: '2Floorsw/Elevator&Stairsto2',
  12: 'Level',
  13: 'nd',
  14: 'ClassAFinishes',
  15: 'On-SitePropertyManagement',
  16: 'LargeGlass Windows',
  17: '5:1Parking',
  18: 'Formoreinformation,contact:',
  19: 'OtherTenants: PivotPhysicalTherapy,TheLundy',
  20: 'LeasingDetails',
  21: 'SpaceDescription',
  22: 'LeaseRate',
  23: 'CompetitiveNNN+$5.50TICAM',
  24: 'Tenant',
  25: 'Suite107:1,700± SF',
  26: 'Janitorial&Electric',
  27: 'Responsibilities',
  28: 'ShellSpacew/TIAllowance&Architecturals',
  29: 'ClassABuilding',
  30: 'SalePrice',
  31: '$374,000or$220PSF',
  32: 'BeautifulDouble-DoorEntry',
  33: '1,700',
  34: '± SF',
  35: 'Size',
  36: 'LargeGlassWindows',
  37: 'ColdDarkShellw/TIAllowance',
  38: '5:1Parking',
  39: 'Upfit'},
 'Top': {0: 6,
  1: 6,
  2: 49,
  3: 103,
  4: 103,
  5: 103,
  6: 590,
  7: 637,
  8: 656,
  9: 676,
  10: 695,
  11: 716,
  12: 716,
  13: 717,
  14: 736,
  15: 755,
  16: 775,
  17: 794,
  18: 813,
  19: 835,
  20: 111,
  21: 138,
  22: 142,
  23: 142,
  24: 169,
  25: 174,
  26: 179,
  27: 190,
  28: 195,
  29: 216,
  30: 217,
  31: 217,
  32: 238,
  33: 247,
  34: 247,
  35: 248,
  36: 259,
  37: 274,
  38: 282,
  39: 285}}

标签: pythonpandas

解决方案


您可以编写一个函数来Top首先对列进行分组,然后groupby在该列上使用:

import pandas as pd
df = pd.DataFrame({'Name': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F'},
 'ID': {0: 100, 1: 23, 2: 19, 3: 42, 4: 11, 5: 78},
 'Left': {0: 70, 1: 70, 2: 70, 3: 70, 4: 66, 5: 66},
 'Top': {0: 10, 1: 26, 2: 26, 3: 35, 4: 60, 5: 71}})

def group(l, group_range):
    groups = []
    current_group = []
    i = 0
    group_count = 1
    while i < len(l):
        a = l[i]
        if len(current_group) == 0:
            if i == len(l) - 1:
                break
            current_group_start = a
        if a <= current_group_start + group_range:
            current_group.append(group_count)
        if a < current_group_start + group_range:
            i += 1
        else:
            groups.extend(current_group)
            current_group = []
            group_count += 1
    groups.extend(current_group)
    return groups
#group(df['Top'],20) -> [1, 1, 1, 2, 3, 3]

df['group'] = group(df['Top'],20)
df.groupby(['group'],as_index=False).agg(list)

输出:

    group   ID              Left            Name        Top
0   1       [100, 23, 19]   [70, 70, 70]    [A, B, C]   [10, 26, 26]
1   2       [42]            [70]            [D]         [35]
2   3       [11, 78]        [66, 66]        [E, F]      [60, 71]

推荐阅读