首页 > 解决方案 > 在一百万条记录上使用 pandas group by 的有效方法

问题描述

我有一个可以使用下面的代码生成的数据框

df2 = pd.DataFrame({'subject_ID':[1,1,1,1,1,1,2,2,2,2],'colum' : ['L1CreaDate','L1Crea','L2CreaDate','L2Crea','L3CreaDate','L3Crea','L1CreaDate','L1Crea','L2CreaDate','L2Crea'], 
                'dates':['2016-10-30 00:00:00',2.3,'2016-10-30 00:00:00',2.5,np.nan,np.nan,'2016-10-30 00:00:00',12.3,'2016-10-30 00:00:00',12.3]})

我正在尝试对上述数据框执行以下操作。虽然代码工作得很好,但问题是当我使用group by statement. 它在样本数据帧中很快,但在超过 100 万条记录的真实数据中,它需要一段时间并且只运行很长时间

    df2['col2'] = df2['colum'].str.split("Date").str[0]
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    df2 = df2.sort_values(by=['subject_ID','col3'])
    df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)

我确实groupby得到了下面的输出count列,这样我就可以拒绝 count as 的记录0。删除 NA 背后有一个逻辑。这不仅仅是放弃所有的NA。如果您想了解这一点,请参考这篇文章,保留少量 NA 并删除其余 NA 逻辑

在真实数据中,一个人可能有超过 10000 行。所以一个数据框有超过 100 万行。

有没有其他更好、更有效的方法来做groupby或得到count专栏?

在此处输入图像描述

标签: pythonpython-3.xpandasdataframepandas-groupby

解决方案


想法是使用列表理解split来提高性能,然后不将输出分配给新列count,而是使用提取的整数进行过滤和最后排序:

df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()

df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
df2 = df2.sort_values(by=['subject_ID','col3'])
print (df2)
   subject_ID       colum                dates    col2  col3
0           1  L1CreaDate  2016-10-30 00:00:00  L1Crea     1
1           1      L1Crea                  2.3  L1Crea     1
2           1  L2CreaDate  2016-10-30 00:00:00  L2Crea     2
3           1      L2Crea                  2.5  L2Crea     2
6           2  L1CreaDate  2016-10-30 00:00:00  L1Crea     1
7           2      L1Crea                 12.3  L1Crea     1
8           2  L2CreaDate  2016-10-30 00:00:00  L2Crea     2
9           2      L2Crea                 12.3  L2Crea     2

如果得到错误:

AttributeError: 'float' 对象没有属性 'split'

这意味着可能存在缺失值,因此应更改列表理解:

df2['col2'] = [x.split("Date")[0] if x == x else np.nan for x in df2['colum']]

检查性能:

def new(df2):
    df2['col2'] = [x.split("Date")[0] for x in df2['colum']]
    df2 = df2[df2.groupby(['subject_ID','col2'])['dates'].transform('count').ne(0)].copy()
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    return df2.sort_values(by=['subject_ID','col3'])


def orig(df2):
    df2['col2'] = df2['colum'].str.split("Date").str[0]
    df2['col3'] = df2['col2'].str.extract('(\d+)', expand=True).astype(int)
    df2 = df2.sort_values(by=['subject_ID','col3'])
    df2['count'] = df2.groupby(['subject_ID','col2'])['dates'].transform(pd.Series.count)
    return df2[df2['count'].ne(0)]

In [195]: %timeit (orig(df2))
10.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [196]: %timeit (new(df2))
6.11 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

推荐阅读