首页 > 解决方案 > 需要在 groupby 调用中优化/避免 pandas .resample(对于 1.4k 行,需要将其降低到 <60 秒——目前 > 160 秒)

问题描述

我有需要重新采样的时间序列条目。在更极端的情况下,我想象有人可能会生成 15 个月的数据——这往往是大约 1300 条记录(每 2 个指标条目大约有 5 个位置条目)。但重采样到 15 分钟间隔后,完整集大约有 41000 行。

我的数据现在不到几十列,所以需要计算 20 列 * 40k ≈ 800k 值.. 这看起来我应该能够将我的时间真正降低到 10 秒以下。我已经完成了一个初始配置文件,看起来瓶颈主要在于我正在调用的一对用于重采样的 pandas 方法——而且它们的速度非常慢!到了我想知道是否有问题的地步……为什么熊猫重新采样的速度如此之慢?

这会在谷歌云功能中产生超时。这就是我需要避免的。

有两组数据:位置和指标。示例位置数据可能如下所示:

location    bar     girlfriends     grocers     home    lunch   park    relatives   work 
date    user                                
2018-01-01 00:00:01     0ce65715-4ec7-4ca2-aab0-323c57603277    0   0   0   1   0   0   0   0

示例指标数据可能如下所示:

user    date    app     app_id  metric
0   4fb488bc-aea0-4f1e-9bc8-d7a8382263ef    2018-01-01 01:30:43     app_2   c2bfd6fb-44bb-499d-8e53-4d5af522ad17    0.02
1   6ca1a9ce-8501-49f5-b7d9-70ac66331fdc    2018-01-01 04:14:59     app_2   c2bfd6fb-44bb-499d-8e53-4d5af522ad17    0.10

我需要将这两个子集合并到一个分类帐中,其中包含每个位置名称和每个应用程序的列。应用程序中的值是常量样本,所以我需要“连接点”。位置中的值是位置更改事件,因此我需要不断重复相同的值,直到下一个更改事件。总而言之,它是这样的:

    app_1   app_2   user    bar     grocers     home    lunch   park    relatives   work
date                                        
2018-01-31 00:00:00     0.146250    0.256523    4fb488bc-aea0-4f1e-9bc8-d7a8382263ef    0   0   1   0   0   0   0
2018-01-31 00:15:00     0.146290    0.256562    4fb488bc-aea0-4f1e-9bc8-d7a8382263ef    0   0   0   0   0   0   1

这段代码可以做到这一点,但需要优化。这里最薄弱的环节是什么?我添加了基本的剖面分析:

import time
start = time.time()

locDf = locationDf.copy()
locDf.set_index('date', inplace=True)

# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('user'):
    locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()

aDf = appDf.copy()
aDf.set_index('date', inplace=True)

print("section1:", time.time() - start)

userLocAppDfs = {}
for user, a2_df in aDf.groupby('user'):    
    start = time.time()

    # per user, convert app data to 15m interval
    userDf = a2_df.resample('15T').agg('max')

    print("section2.1:", time.time() - start)
    start = time.time()

    # assign metric for each app to an app column for each app, per user   
    userDf.reset_index(inplace=True)
    userDf = pd.crosstab(index=userDf['date'], columns=userDf['app'], values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')

    userDf['user'] = user

    userDf.reset_index(inplace=True)
    userDf.set_index('date', inplace=True)

    print("section2.2:", time.time() - start)
    start = time.time()

    # reapply 15m intervals now that we have new data per app
    userLocAppDfs[user] = userDf.resample('15T').agg('max')

    print("section2.3:", time.time() - start)
    start = time.time()

    # assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created    
    loDf = locDfs[user]
    loDf.reset_index(inplace=True)
    loDf = pd.crosstab([loDf.date, loDf.user], loDf.location)
    loDf.reset_index(inplace=True)

    loDf.set_index('date', inplace=True)
    loDf.drop('user', axis=1, inplace=True)

    print("section2.4:", time.time() - start)
    start = time.time()

    # join the location crosstab columns with the app crosstab columns per user
    userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
    # convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
    userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
    userLocAppDfs[user]['user'].fillna(user, inplace=True)

    print("section2.5:", time.time() - start)
    start = time.time()

    for loc in locationDf[locationDf['user'] == user].location.unique():

        # fill location NaNs
        userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(np.nan, 0)

    print("section3:", time.time() - start)
    start = time.time()


    # fill app NaNs
    for app in a2_df['app'].unique():
        userLocAppDfs[user][app].interpolate(method='linear', limit_area='inside', inplace=True)       
        userLocAppDfs[user][app].fillna(value=0, inplace=True)

    print("section4:", time.time() - start)

结果:

section1: 41.67342448234558
section2.1: 11.441165685653687
section2.2: 0.020460128784179688
section2.3: 5.082422733306885
section2.4: 0.2675948143005371
section2.5: 40.296404123306274
section3: 0.0076410770416259766
section4: 0.0027387142181396484
section2.1: 11.567803621292114
section2.2: 0.02080368995666504
section2.3: 7.187351703643799
section2.4: 0.2625312805175781
section2.5: 40.669641733169556
section3: 0.0072269439697265625
section4: 0.00457453727722168
section2.1: 11.773712396621704
section2.2: 0.019629478454589844
section2.3: 6.996192693710327
section2.4: 0.2728455066680908
section2.5: 45.172399282455444
section3: 0.0071871280670166016
section4: 0.004514217376708984

两个“大”部分都调用了resampleagg('max')

笔记:

我从 12 个月前发现了这个问题:Pandas groupby + resample first 真的很慢 - 因为版本 0.22 - 目前 groupby 中的 resample() 似乎已经坏了。

标签: pythonpandas

解决方案


推荐阅读