首页 > 解决方案 > Python如何在多个规则下将两行合二为一

问题描述

一次运行代码时,我尝试组合多对行。正如我的示例所示,对于可以组合的两行,规则是,

我的例子是,

df0 = pd.DataFrame({'ID':['1001','1002','1003','1004','2001','2002','2003','2004','3001','3002','3003','3004','4001','4002','4003','4004','5001','5002','5003','5004','6001'],
                   'PT':['B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','B','D','D','D','D','F'],
                   'DS':['AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAB','AAA','AAA','AAA','AAB','AAB'],
                   'SC':['P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P1','P2','P2','P2','P2','P1','P1','P1','P2','P2'],
                   'FS':['2020-10-16 00:00:00','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:04','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:02','2020-10-16 00:00:03','2020-10-16 00:00:00','2020-10-16 00:00:01','2020-10-16 00:00:05','2020-10-16 00:00:07','2020-10-16 00:00:01','2020-10-16 00:00:10','2020-10-16 00:10:00','2020-10-16 00:10:40','2020-10-16 00:00:00','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
                   'WT':[1,2,3,4,10,11,12,13,20,21,22,23,30,31,32,33,40,41,42,43,53],
                   'CB':[0.1,0.2,0.3,0.4,1,1.1,1.2,1.3,2,2.1,2.2,2.3,3,3.1,3.2,3.3,4,4.1,4.2,4.3,5.3]})

一次运行代码时,新的数据帧 df1 是,

df1 = pd.DataFrame({'ID':['1001,1002','1003,1004','2001,2002','2003,2004','3001,3002','3003,3004','4001,4002','4003,4004','5001,5002','5003','5004','6001'],
                   'PT':['B','B','B','B','B','B','B','B','D','D','D','F'],
                   'DS':['AAA','AAA','AAA','AAA','AAB','AAB','AAB','AAB','AAA','AAA','AAB','AAB'],
                   'SC':['P1','P1','P2','P2','P1','P1','P2','P2','P1','P1','P2','P2'],
                   'FS':['2020-10-16 00:00:02','2020-10-16 00:00:04','2020-10-16 00:00:01','2020-10-16 00:00:03','2020-10-16 00:00:01','2020-10-16 00:00:07','2020-10-16 00:00:10','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:40','2020-10-16 00:00:10','2020-10-16 00:00:05'],
                   'WT':[3,7,21,25,41,45,61,65,81,42,43,53],
                   'CB':[0.3,0.7,2.1,2.5,4.1,4.5,6.1,6.5,8.1,4.2,4.3,5.3]})

在 df1 上再次运行代码时,新的数据帧 df2 是,

df2 = pd.DataFrame({'ID':['1001,1002,1003,1004','2001,2002,2003,2004','3001,3002,3003,3004','4001,4002,4003,4004','5001,5002,5003','5004','6001'],
                   'PT':['B','B','B','B','D','D','F'],
                   'DS':['AAA','AAA','AAB','AAB','AAA','AAB','AAB'],
                   'SC':['P1','P2','P1','P2','P1','P2','P2'],
                   'FS':['2020-10-16 00:00:04','2020-10-16 00:00:03','2020-10-16 00:00:07','2020-10-16 00:10:40','2020-10-16 00:10:00','2020-10-16 00:00:10','2020-10-16 00:00:05'],
                   'WT':[10,46,86,126,123,43,53],
                   'CB':[1,4.6,8.6,12.6,12.3,4.3,5.3]})

这里不能在 df2 上进行更多的合并,因为没有任何一对行符合规则。

原因是我有内存限制,必须在不丢失信息的情况下减小数据大小。因此,我尝试将具有相同功能且彼此靠近的 ID 捆绑在一起。我计划多次运行代码,直到不再出现内存问题或不再可能的组合。

标签: pythonpandasnumpypandas-groupby

解决方案


这是使用 GroupBy 操作的好地方。

我的资料来源是 Wes McKinney 的 Python for Data Analysis。

df0['ID'] = df0.groupby([df0['PT'], df0['DS'], df0['SC']])['ID'].transform(lambda x: ','.join(x))

max_times = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).max().drop(['WT', 'CB'], axis = 1)

sums_WT_CB = df0.groupby(['ID', 'PT', 'DS', 'SC'], as_index = False).sum()

df2 = pd.merge(max_times, sums_WT_CB, on=['ID', 'PT', 'DS', 'SC'])

此代码只占用您指定的每个唯一列分组的最近时间。如果 FS 列有其他要求,则必须对其进行修改。

连接 ID 的代码来自: Concatenate strings from several rows using Pandas groupby


推荐阅读