首页 > 解决方案 > 针对特定列将行合并到一个单元格(到一个元组中)

问题描述

假设我有一个如下表:例如,在 id 2 下,这一行仅显示该 id 的状态和日期的历史记录。

 *id*,             *status*,                      *date*,               *Type*  
        2             dissolved                   2016/03/19                T1 
        nan           active                         NaT                    nan
        3             dissolved                  2016/03/19,                T3 
        nan           active                     2012/03/16                 nan
        4             in liquidation,            2017/03/19                 T2 
        nan           dissolved,                      NaT                   nan

我想要做的是按 id 组合行,例如对于第一个 id = 2 我得到:

 *id*,             *status*,                      *date*,               *Type*  
   2             [dissolved,active]              [2016/03/19,None]          T1 
                                                  

我努力了:

 data.groupby(['id']).agg(lambda x: tuple(x)).applymap(list).reset_index() 

但是当我想要的只是 2 时,这会将 id 列设置为 [2,nan]。我该怎么做呢?我只想要我的状态和日期值的列表格式,而不是把它们全部加起来!

输入 df:

df = pd.DataFrame([[2,"dissolved","2016/03/19","T1" ],
            [float("nan"),"active","NaT",float("nan")],
            [3,"dissolved","2016/03/19","T3" ],
            [float("nan"),"active","2012/03/16",float("nan")],
            [4,"in liquidation","2017/03/19","T2" ],
            [float("nan"),"dissolved","NaT",float("nan")]],columns = ["id","status","date","Type"])

标签: pythonpandasrow

解决方案


尝试以下操作:

df = pd.DataFrame([[2,"dissolved","2016/03/19","T1" ],
            [float("nan"),"active","NaT",float("nan")],
            [3,"dissolved","2016/03/19","T3" ],
            [float("nan"),"active","2012/03/16",float("nan")],
            [4,"in liquidation","2017/03/19","T2" ],
            [float("nan"),"dissolved","NaT",float("nan")]],columns = ["id","status","date","Type"])
df = df.ffill()
df["status"] = df["status"]  + ","
df["date"] = df["date"]  + ","
df2 = df.groupby(["id","Type"]).sum()
df2["status"] = df2["status"].apply(lambda x: x.split(",")[0:len(x.split(","))-1])
df2["date"] = df2["date"].apply(lambda x: x.split(",")[0:len(x.split(","))-1])
df2

根据您的数据集,您可能需要对其进行一些调整。它输出以下内容:

          status                        date
id  Type        
2.0 T1   [dissolved, active]           [2016/03/19, NaT]
3.0 T3   [dissolved, active]           [2016/03/19, 2012/03/16]
4.0 T2   [in liquidation, dissolved]   [2017/03/19, NaT]

推荐阅读