首页 > 解决方案 > 如何用python更快地计算指数移动平均线?

问题描述

我正忙于构建回测软件,但在创建指数移动平均线时遇到了麻烦。我成功地使用 for 循环创建了它,但每个我想测试的符号运行大约需要 20 秒(太长)。

如果有人有任何建议,我正在尝试找到更快的解决方案。

我当前的代码看起来像这样,但它不会产生正确的结果。

def exponential_moving_average(df, period):
    # Create a copy of original dataframe to work with.
    dataframe = df.copy()

    dataframe['EMA'] = dataframe['Close'].ewm( span        = period,
                                               adjust      = False,
                                               min_periods = period,
                                               ignore_na   = True
                                               ).mean()

    return dataframe['EMA']

此方法在指标类中,输入采用以下内容。

这是一个包含值的片段df

            symbol     Open     High      Low    Close  ATR     slow_ma
Date        
2010-01-03  EURUSD  1.43075  1.43369  1.43065  1.43247  NaN       NaN   
2010-01-04  EURUSD  1.43020  1.44560  1.42570  1.44120  NaN       NaN   
2010-01-05  EURUSD  1.44130  1.44840  1.43460  1.43650  NaN       NaN   
2010-01-06  EURUSD  1.43660  1.44350  1.42820  1.44060  NaN       NaN   
2010-01-07  EURUSD  1.44070  1.44470  1.42990  1.43070  NaN       NaN   
2010-01-08  EURUSD  1.43080  1.44380  1.42630  1.44160  NaN       NaN   
2010-01-10  EURUSD  1.44245  1.44252  1.44074  1.44110  NaN       NaN   
2010-01-11  EURUSD  1.44280  1.45560  1.44080  1.45120  NaN       NaN   
2010-01-12  EURUSD  1.45120  1.45450  1.44530  1.44840  NaN       NaN   
2010-01-13  EURUSD  1.44850  1.45790  1.44570  1.45100  NaN  1.442916   
2010-01-14  EURUSD  1.45090  1.45550  1.44460  1.44990  NaN  1.444186   
2010-01-15  EURUSD  1.45000  1.45110  1.43360  1.43790  NaN  1.443043   
2010-01-17  EURUSD  1.43597  1.43655  1.43445  1.43480  NaN  1.441544   
2010-01-18  EURUSD  1.43550  1.44000  1.43340  1.43830  NaN  1.440954   
2010-01-19  EURUSD  1.43820  1.44130  1.42520  1.42870  NaN  1.438726

这是slow_ma(10天)的预期结果

            symbol     Open     High      Low    Close  ATR     slow_ma
Date        
2010-01-03  EURUSD  1.43075  1.43369  1.43065  1.43247  NaN       NaN   
2010-01-04  EURUSD  1.43020  1.44560  1.42570  1.44120  NaN       NaN   
2010-01-05  EURUSD  1.44130  1.44840  1.43460  1.43650  NaN       NaN   
2010-01-06  EURUSD  1.43660  1.44350  1.42820  1.44060  NaN       NaN   
2010-01-07  EURUSD  1.44070  1.44470  1.42990  1.43070  NaN       NaN   
2010-01-08  EURUSD  1.43080  1.44380  1.42630  1.44160  NaN       NaN   
2010-01-10  EURUSD  1.44245  1.44252  1.44074  1.44110  NaN       NaN   
2010-01-11  EURUSD  1.44280  1.45560  1.44080  1.45120  NaN       NaN   
2010-01-12  EURUSD  1.45120  1.45450  1.44530  1.44840  NaN       NaN   
2010-01-13  EURUSD  1.44850  1.45790  1.44570  1.45100  NaN   1.44351   
2010-01-14  EURUSD  1.45090  1.45550  1.44460  1.44990  NaN   1.44467   
2010-01-15  EURUSD  1.45000  1.45110  1.43360  1.43790  NaN   1.44344   
2010-01-17  EURUSD  1.43597  1.43655  1.43445  1.43480  NaN   1.44187   
2010-01-18  EURUSD  1.43550  1.44000  1.43340  1.43830  NaN   1.44122   
2010-01-19  EURUSD  1.43820  1.44130  1.42520  1.42870  NaN   1.43894

我已经更改了第一个数据帧的值,以便它显示用于计算slow_ma.

这是我在 Stackoverflow 上的第一篇文章,所以请问是否有不清楚的地方。

标签: pythonpandasnumpyfinancequantitative-finance

解决方案


How to calculate an exponential moving average with python faster ?

speeds under < 50 [us] for your sized data/period on an old 2.6 [GHz] i5 device achievable...


Step 0: Get the results ( the process ) pass the Quality Assurance

Having fast but wrong data has negative value added, right?

Given you are using a "hardwired" .ewm() method, you can but re-read it's parametrisation options, if different dataframe['Close'] column-processing modes are possible.

As a fast check:

aPV = [ 1.43247, # borrowed from dataframe['Close']
        1.44120,
        1.43650,
        1.44060, 1.43070, 1.44160, 1.44110, 1.45120, 1.44840,
        1.45100, 1.44990, 1.43790, 1.43480, 1.43830, 1.42870,
        ]
|>>> QuantFX.numba_EMA_fromPrice2( N_period     = 10,
                                   aPriceVECTOR = QuantFX.np.array( aPV )
                                   )
array([ 
        1.43247   ,
        1.43405727,
        1.4345014 ,
        1.43561024,
        1.43471747,
        1.43596884,
        1.43690178,
        1.43950145,
        1.44111937,
        1.44291585,
        1.44418569,
        1.44304284,
        1.44154414,
        1.4409543 ,
        1.43872624
        ]
      )

for which there are some ~ +/- 3E-7 numerical-representation differences from values in the first Table above ( i.e. 2 orders below the LSD ).

|>>> ( QuantFX.numba_EMA_fromPrice2( 10,
                                     QuantFX.np.array( aPV )
                                     )
     - QuantFX.np.array( slow_EMA_1 )# values borrowed from Table 1 above
       )

array([             nan,
                    nan,
                    nan,
                    nan,
                    nan,
                    nan,
                    nan,
                    nan,
                    nan,
        -1.50656152e-07,
        -3.05082306e-07,
        -1.58703705e-07,
         1.42878787e-07,
         2.98719007e-07,
         2.44406460e-07
         ]
        )

Step 1: Tweak the ( QA-confirmed ) processing for better speed

During this phase, a lot depends on outer context of use.

Best results could be expected from cythonize(), yet profiling may show some surprises on the fly.

Without moving the processing into the cython-code, one can get interesting speedups on global use of float64-s instead of float32-s ( got shaved off some 110 ~ 200 [us] on similar EMA depths ), vectorised inplace assignments ( ~ 2x speedup, from ~ 100 [us] to ~ 50 [us] in better combined vector-memory allocation of resulting vector and its vectorised value processing ) and best, if mathematical re-formulation can help skip some just "mechanical" operations at all.

Yet, all the speedup tricks depend on used tools - if a pure numpy, or numpy + numba ( which may yield negative effects on as trivial processing as EMA out of question is - having not much mathematical " meat for Dr.Jackson" to actually number-crunch ) or cython-optimised solution, so profiling in the target CPU-context is a must, if best results are to get delivered.


trying to find a faster solution ...

Would be interesting to update your post with a statement of what is your expected target speedup, or better a target per-call processing cost in a [TIME]-domain for the stated problem, on a given [SPACE]-domain scale of data ( window == 10, aPriceVECTOR.shape[0] ~ 15 ), and if a target code-execution platform has some hardware / CPU / cache-hierarchy composition constraints or not, because building a backtester-platform actually massively emphasises any and all code-design + code-execution inefficiencies.


Given the EMA is reasonably efficient, tools may get ~ 4x speedup

The QuantFX story has gone from ~ 42000 [us] down to ~ 21000[us] without numba/JIT tools just by re-formulated and memory-optimised vector processing ( Using an artificial sized workload payloads, processing a block of aPV[:10000] ).

Next, the run-time went yet down, to ~ 10600 [us], using the as-is Cpython code-base, just with a permission to auto-Cythonise, where possible, an import-ed code using pyximport:

pass;        import pyximport
pass;               pyximport.install( pyimport = True )
from QuantFX import numba_EMA_fromPrice2
...

So, can get speeds
~ 45 ~ 47 [us] for your sized data aPV[:15], period = 10 on an ordinary 2.6 [GHz] i5 device.


If insisting on using pandas dataframe-tools and methods, your performance is principally in the hands of pandas-team, not much to do here about their design compromises, that had to be done on an ever present dilemma between a speed and a universality.


推荐阅读