首页 > 解决方案 > 在依赖于前一行值的 DataFrame 上应用函数

问题描述

我想检测时间序列的最大值和最小值,总是在左边。看对了就是看未来,因为它是在现场分析的。我的方法:

它是这样翻译的:

import pandas as pd

timerange = pd.date_range(start='1/1/2018', end='1/31/2018')
data = [0, 1, 2, 3, 4, 2, 1, 0, -1, 0, 3, 2, 1, 1, 0.5, 0, 1, 2, 4, 5, 6, 7, 8, 4, -2, -4, 0, 5, 3, 2, 0]
timeseries = pd.DataFrame(index=timerange, data=data, columns=['Value'])

max = data[0]
min = data[0]
pct = .5
tendancy = False
for now in timeseries.index:

    value = timeseries.loc[now, 'Value']

    if value >= max:
        max = value
    if value <= min:
        min = value

    range = max-min

    # Cancel the previous max value when going up if the 50% rule is triggered
    if value >= min + range * pct and tendancy != 'up':
        tendancy = 'up'
        max = value
    # Cancel the previous min value when going down if the 50% rule is triggered
    elif value <= max - range * pct and tendancy != 'down':
        tendancy = 'down'
        min = value

    ratio = (value-min)/(max-min)

    timeseries.loc[now, 'Max'] = max
    timeseries.loc[now, 'Min'] = min
    timeseries.loc[now, 'Ratio'] = ratio

timeseries[['Value', 'Min', 'Max']].plot()
timeseries['Ratio'].plot(secondary_y=True)

它按预期工作,因此,查看Ratio变量,您知道当前是定义新低 (0) 还是新高 (1),无论信号的幅度或频率如何。

但是,在我的真实数据(~200 000 行)上,它是超长的。我想知道是否有办法对此进行优化,尤其是使用.apply()DataFrame 的方法。但是由于结果取决于上一行,我不知道这种方法是否适用。

标签: pandas

解决方案


您可以做的第一个也是简单的加速操作不是迭代索引并每次都使用 访问loc,而是直接迭代值并将max-, min-, ratio-您想要的三个结果()附加到列表中,例如:

max_ = data[0] #NOTE: I rename the variables with _ to avoid using builtin method names
min_ = data[0]
pct = .5
tendancy = False
l_res = [] # list for the results
for value in timeseries['Value'].to_numpy(): #iterate over the values

    if value >= max_:
        max_ = value
    if value <= min_:
        min_ = value

    range_ = max_-min_

    # Cancel the previous max value when going up if the 50% rule is triggered
    if value >= min_ + range_ * pct and tendancy != 'up':
        tendancy = 'up'
        max_ = value
    # Cancel the previous min value when going down if the 50% rule is triggered
    elif value <= max_ - range_ * pct and tendancy != 'down':
        tendancy = 'down'
        min_ = value

    ratio = (value-min_)/(max_-min_)
    # append the three results in the list
    l_res.append([max_, min_, ratio])

# create the three columns outside of the loop
timeseries[['Max', 'Min','Ratio']] = pd.DataFrame(l_res, index=timeseries.index)

在时间方面,我把两种方式都放在了函数中(你的 f_maxime 和这个 f_ben ),它给出了:

%timeit f_maxime(timeseries)
# 16.4 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit f_ben(timeseries)
# 651 µs ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

所以这种方式大约快 25 倍,对于 200K 行,我认为它仍然应该快 25 倍。我还检查了结果是否相同:

(f_ben(timeseries).fillna(0) == f_maxime(timeseries).fillna(0)).all().all()
#True

关于 的使用apply,我认为在这种情况下加速代码没有任何价值,请参阅


推荐阅读