首页 > 解决方案 > 按列分组并连接多列的唯一字符串值以创建单列

问题描述

我有一个非常大的数据框,我想按“NAME”列进行分组,并连接 ID、ID2、ID3 和 ID4 列中的唯一字符串值以创建一列。

我努力了

df.groupby('NAME').apply(lambda x: x['ID'] + x['ID2'] + x['ID3'] + x['ID4'])

我已经尝试df.groupby.agg了多个 lambda 函数,但需要找出一种pd.unique与字符串值一起使用的方法。

NAME   Type    ID     ID2     ID3     ID4    MEMBERSHIP
Capg   Active  778    535     667     898     Global
Capg   Active  778    835     100     444     Blue
Capg   Active  778    535     667     898     Black
Doy   Active  246     8989    667     777     Silver
Doy   Active  266     8989    900     777     Silver
Doy   Active  266     8989    900     777     Silver
Art   Active  778     135     888     007     White
Art   Active  778     135     888     007     Silver
Art   Active  778     135     888     008     White
Art   Active  778     135     888     007     White

期望的输出:

NAME   Type    ID     ID2     ID3     ID4    MEMBERSHIP  MERGED_IDS
Capg   Active  778    535     667     898     Global     778, 535, 667, 898, 835, 100
Capg   Active  778    835     100     444     Blue       778, 535, 667, 898, 835, 100
Capg   Active  778    535     667     898     Black      778, 535, 667, 898, 835, 100
Doy   Active  246     8989    667     777     Silver     246, 8989, 667, 777, 266, 900
Doy   Active  266     8989    900     777     Silver     246, 8989, 667, 777, 266, 900
Doy   Active  266     8989    900     777     Silver     246, 8989, 667, 777, 266, 900
Art   Active  778     135     888     007     White      778, 135, 888, 007, 008
Art   Active  778     135     888     007     Silver     778, 135, 888, 007, 008
Art   Active  778     135     888     008     White      778, 135, 888, 007, 008
Art   Active  778     135     888     007     White      778, 135, 888, 007, 008

标签: pythonpandas-groupbygroup-concat

解决方案


您可以将您的 ID 转换为集合,然后获取它们的并集:

import io
import pandas as pd

data = """NAME   Type    ID     ID2     ID3     ID4    MEMBERSHIP
Capg   Active  778    535     667     898     Global
Capg   Active  778    835     100     444     Blue
Capg   Active  778    535     667     898     Black
Doy   Active  246     8989    667     777     Silver
Doy   Active  266     8989    900     777     Silver
Doy   Active  266     8989    900     777     Silver
Art   Active  778     135     888     007     White
Art   Active  778     135     888     007     Silver
Art   Active  778     135     888     008     White
Art   Active  778     135     888     007     White"""

df = pd.read_csv(io.StringIO(data), sep=' ', skipinitialspace=True, dtype=str)

def group_IDs(x):
    return set(x['ID']) | set(x['ID2']) | set(x['ID3']) | set(x['ID4'])
grouped = df.groupby("NAME").apply(group_IDs)
grouped.name = "MERGED_IDS"
df.merge(grouped, left_on='NAME', right_index=True)

导致: 在此处输入图像描述


推荐阅读