首页 > 解决方案 > R - 使用 nnetar 进行预测时的预测区间变窄

问题描述

我正在尝试使用预测包中的 nnetar 函数构建预测。与其他方法相比,我得到了一些相当不错的预测,但我在生成非常窄的预测间隔时遇到了麻烦。

我试图预测的数据集是来自电子商务的每周收入数据,其中转化率和广告支出作为解释性 x 变量 (Xreg)

这就是我的预测的样子: 在此处输入图像描述 这是我用来生成它的代码:

fit_test <- nnetar(total_revenue_ts, size = 5, repeats = 200, xreg = xreg)
fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = xreg_test, h=26)
autoplot(fit_test_fc) + autolayer(test_rev_ts$total)

这是我使用的数据:

total_revenue_ts <- structure(c(429527.84912, 5107534.789265, 5334742.992202, 7062236.076739, 
7376937.2329, 8843885.679834, 10312889.001099, 4743025.186331, 
1063820.467744, 8647610.570788, 7615849.910587, 6950888.592508, 
6858879.08066, 7207686.138817, 6552543.847104, 6286320.862515, 
6387758.212579, 6267651.456223, 6166523.577491, 6517987.757523, 
4032163.322867, 6774882.672302, 7280882.606489, 7042888.802793, 
5864325.907872, 7614073.472534, 5702820.168872, 5993043.498666, 
5748712.530684, 5781854.779949, 6514731.488613, 6200435.741256, 
6716691.630149, 5671091.670532, 6849896.078633, 6412725.445233, 
5820498.437424, 5140661.371894, 5543105.774292, 6498649.993838, 
6832579.992745, 6363471.54561, 5764986.861829, 6479827.767348, 
6082916.613222, 5654806.062709, 6250723.443025, 7021696.610899, 
6878521.38167, 6605964.840134, 5860880.924163, 6027383.028358, 
7271275.876805, 5788375.978398, 5952319.104294, 8700792.56985, 
9387497.556219, 10628335.699833, 12300448.541447, 7624816.545391, 
8041602.838183, 7340912.745611, 6475830.912185, 6511598.406008, 
7670675.084654, 6597851.103698, 5992838.357045, 5782002.308393, 
7591927.838791, 6316308.891923, 6024260.46223, 6099526.226113, 
5341138.559686, 5959177.962052, 4614361.675905, 5649334.049846, 
6774789.19439, 7823320.381864, 5941416.816392, 6576822.658397, 
4949544.168466, 6394315.633561, 5432101.434962, 5971872.77196, 
6375234.021085, 6776885.612781, 6381300.2023, 5376238.120971, 
4654630.262986, 5404870.534346, 6616177.722868, 6627152.023493, 
6566693.385556, 6687236.645467, 6473086.938295, 5478904.979073, 
5884130.390298, 6219789.15664), .Tsp = c(2015.84615384615, 2017.71153846154, 
52), class = "ts")

xreg <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 5723.69, 5528.78, 6099.31, 13001.39, 6750.07, 
6202.91, 6685.01, 5020, 5094.73, 2714.07, 9645.9, 8208.18, 6297.5, 
8053.29, 0, 4418.27, 9393.52, 11139.19, 12678.08, 12493.18, 11242.28, 
9617.09, 6959.37, 11716.52, 8464.61, 1499.14, 14538.86, 12080.69, 
11905.71, 14405.72, 9077.05, 10362.49, 13776.75, 17620.9, 14767.2, 
19511.98, 19747.72, 19885.44, 16810.46, 10618.04, 7494.02, 8166.45, 
7503.29, 7955.54, 7971.87, 14520.84, 19219.74, 18824.67, 27216.48, 
32030.82, 32007.76, 24153.88, 20472.33, 17617.01, 4.77806579193501, 
5.7287751461365, 5.28098389402001, 5.02434789173682, 4.95184840012426, 
5.64277441770686, 5.37984870432963, 5.3906432267007, 5.43849275673932, 
5.6884135855546, 5.2709838799333, 5.41075942817439, 4.94887009216008, 
4.95521307717827, 5.62734185585188, 5.51042637235732, 5.29807054480431, 
5.52756845275268, 5.70969961018115, 5.54781801299907, 5.73014260988972, 
5.99759204482959, 6.22750289793179, 5.93356463634027, 5.69127817622951, 
5.57154841999638, 5.66114857960352, 5.72923212265929, 5.31293510374571, 
5.35736716492903, 5.65568332596196, 5.74619318262752, 5.5954764989987, 
5.34701430785202, 5.38617886178862, 6.0341348094332, 5.46323395671082, 
5.33899929707969, 5.22135801253651, 5.65190410869423, 5.28112320474013, 
4.80649483723496, 4.81842452314323, 5.00675102835432, 4.49345845605863, 
3.82212461085761, 4.62551440329218, 3.79930173346953, 5.71101883613167, 
6.40135958079592, 7.1027311558873, 4.0456548762572, 4.86275624624909, 
3.68451118002285, 5.40269725877529, 5.24419134903069, 5.0344951706761, 
4.89131058216232, 5.63214154072982, 5.52286515754452, 4.99781361730586, 
5.09012974090091, 5.43346256247373, 5.20251523559131, 5.25889558131295, 
4.17869474160865, 5.59036205822923, 5.33376848927069, 5.38868363783592, 
5.43341024859593, 5.19857108205253, 5.19137882047327, 5.23814895237021, 
5.01957530659338, 5.48137535816619, 5.67523044227311, 5.26029025707068, 
5.18449109254837, 5.24915583751151, 5.45151430953043, 5.34584086799277, 
4.97336938233212, 5.22618004090631, 5.52619366814479, 5.70389182510811, 
5.75578084064244, 5.53339664450776, 5.16303263313334, 5.88409835642594, 
5.56461936196381, 5.20891730381574, 5.21675833063733, 5.30279468609766, 
5.22628072593614, 4.77056025260184, 4.72482746416563, 4.68623694730198, 
5.07214098963881), .Dim = c(98L, 2L), .Dimnames = list(NULL, 
    c("adCost", "transactionsPerSession")), .Tsp = c(2015.84615384615, 
2017.71153846154, 52), class = c("mts", "ts", "matrix"))

xreg_test <- structure(c(17617.01, 13526.88, 14836.89, 20358.16, 20416.79, 
21635.72, 15456.3, 12569.27, 18673, 20591.58, 18922.52, 19658.27, 
21371.37, 20921.06, 18846.68, 17315.48, 18569.47, 20276.32, 17932.33, 
18405.48, 17566.76, 15605.29, 18694.58, 17082.73, 18291.26, 18211.78, 
18252.98, 5.07214098963881, 4.9644513137558, 4.50735617759714, 
3.42940249666707, 5.57244242550868, 6.85297018333131, 8.27499041424656, 
5.64773791252811, 4.17746355274814, 4.78132627344352, 4.5212649754887, 
4.16629173040583, 3.95132622368061, 4.2603550295858, 4.07247936849659, 
3.98828918165935, 3.8364837584878, 4.32967453511229, 4.10479719434903, 
3.88986772076209, 3.89750505731625, 4.02224223511425, 4.23119830350054, 
3.54885240337703, 4.05530730967035, 4.46043036568541, 4.59654125314768
), .Dim = c(27L, 2L), .Dimnames = list(NULL, c("adCost", "transactionsPerSession"
)), .Tsp = c(2017.71153846154, 2018.21153846154, 52), class = c("mts", 
"ts", "matrix"))

test_rev_ts$total <- structure(c(6219789.15664, 6207675.91913, 5375609.354946, 5970907.816396, 
4905889.954914, 6003436.003269, 6311734.743992, 5771009.21678, 
5284469.645259, 7228321.956032, 7070364.421462, 8978263.238038, 
11173150.908703, 8212310.181272, 5336736.750351, 6918492.690826, 
7807812.156676, 7025220.106499, 6539795.925754, 6734049.267568, 
6736165.004623, 5775402.314813, 6083716.578991, 6441420.211984, 
6269669.541568, 4968476.314634, 11122809.394872), .Tsp = c(2017.71153846154, 
2018.21153846154, 52), class = "ts")

如果有人能解释为什么我得到如此窄的预测间隔以及如何解决它,我将不胜感激。

标签: rtime-seriesforecasting

解决方案


为什么预测区间这么窄?

默认情况下,nnetar将来自样本内残差的信息用于预测区间中使用的创新。根据模型的复杂性,残差可以任意小。文档给出了这个警告

请注意,如果网络过于复杂并且过拟合数据,则残差可以任意小;如果用于预测区间计算,它们可能会导致误导性的小值。

与此相关的是,您的时间序列有 98 个点,模型有 31 个参数。此外,数据的季节性周期为 52,当使用季节性滞后时,您实际上只有 46 个数据点可以拟合。

作为参考,残差的标准差nnetar大约是残差的 4 倍auto.arima


如何处理狭窄的预测区间?

有几种可能性。为了加快这些示例的计算,我从您的示例中减少了拟合模型的数量 (to repeats = 50) 和 PI 模拟的数量 (to npaths = 50)。为了排除这些变化和 RNG 的影响,请考虑以下模型作为基线:

set.seed(1234)
fit_test <- nnetar(total_revenue_ts, size = 5, repeats = 50, xreg = xreg)
fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = xreg_test, npaths = 50)
autoplot(fit_test_fc) + autolayer(test_rev_ts)

在此处输入图像描述

  • 提供更好的创新以forecast供使用

    这些将影响间隔,但平均预测将保持不变。

    • 手动设置创新

      如果您对更合适的创新有一些外部知识可以使用,您可以innov在预测时通过参数提供它们。

      例如,假设您碰巧知道创新的标准差实际上应该比残差显示的值大 3 倍。然后你可以这样做:

      set.seed(1234) fit_test <- nnetar(total_revenue_ts, size = 5, repeats
      = 50, xreg = xreg)
      ## Set up new innovations for PI 
      res_sd <- sd(residuals(fit_test), na.rm=T) 
      myinnovs <- rnorm(nrow(xreg_test)*50, mean=0, sd=res_sd*3)
      ## fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = xreg_test, npaths = 50, innov = myinnovs) 
      autoplot(fit_test_fc) + autolayer(test_rev_ts)
      

      在此处输入图像描述

    • 使用样本外值

      您可以通过使用样本外残差而不是样本内残差来估计更好的创新。subsetin 参数 nnetar允许您仅拟合部分数据。您还可以使用该CVar函数进行交叉验证并从那里获取残差。这是使用后者的示例:

      set.seed(1234) 
      fit_test <- nnetar(total_revenue_ts, size = 5, repeats = 50, xreg = xreg)
      ## Set up new innovations for PI 
      fit_test_cv < CVar(total_revenue_ts,  size = 5, repeats = 50, xreg = xreg) 
      res_sd <- sd(fit_test_cv$residuals, na.rm=T) 
      myinnovs <- rnorm(nrow(xreg_test)*50, mean=0, sd=res_sd)
      ## 
      fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = reg_test, npaths = 50, innov = myinnovs) 
      autoplot(fit_test_fc) + autolayer(test_rev_ts)
      

      在此处输入图像描述

  • 过拟合控制

    除了预测区间之外,这些修改还会影响您的模型,因此平均预测与基线相比会发生变化。

    • 消除季节性滞后

      您的数据大约有 2 个季节性周期。使用季节性滞后会使您损失很大一部分,因为模型需要滞后值进行拟合和预测。您可以删除季节性成分,也许会增加额外的滞后来补偿。在下面的示例中,通过有更多的滞后,我将参数的数量增加到 36 个,但由于没有季节性滞后,“获得”了 49 分。

      set.seed(1234)
      fit_test <- nnetar(total_revenue_ts, p=3, P=0, size = 5, repeats = 50, xreg = xreg)
      fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = xreg_test, npaths = 50)
      autoplot(fit_test_fc) + autolayer(test_rev_ts)
      

      在此处输入图像描述

    • 降低模型复杂度

      如前所述,5 个神经元有 31 个参数。例如,size=2将这个数字减少到 13 个。

      set.seed(1234)
      fit_test <- nnetar(total_revenue_ts, size = 2, repeats = 50, xreg = xreg)
      fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = xreg_test, npaths = 50)
      autoplot(fit_test_fc) + autolayer(test_rev_ts)
      

      在此处输入图像描述

    • 使用正则化

      为了补偿模型的复杂性,我们可以使用decayfrom 的参数nnet进行正则化。

      set.seed(1234)
      fit_test <- nnetar(total_revenue_ts, size = 5, repeats = 50, xreg = xreg, decay = 1)
      fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = xreg_test, npaths = 50)
      autoplot(fit_test_fc) + autolayer(test_rev_ts)
      

      在此处输入图像描述


底线

如果适合您的使用,也可以组合其中的几个选项,但归根结底,重要的是要记住这些是复杂的模型,您只能用大约 100 个数据点做很多事情。

下面是结合样本外残差的正则化模型的样子:

set.seed(1234)
fit_test <- nnetar(total_revenue_ts, size = 5, repeats = 50, xreg = xreg, decay = 0.1)
## Set up new innovations for PI
fit_test_cv <- CVar(total_revenue_ts,  size = 5, repeats = 50, xreg = xreg, decay = 0.1)
res_sd <- sd(fit_test_cv$residuals, na.rm=T)
myinnovs <- rnorm(nrow(xreg_test)*50, mean=0, sd=res_sd)
##
fit_test_fc <- forecast(fit_test, PI=TRUE , xreg = xreg_test, npaths = 50, innov = myinnovs)
autoplot(fit_test_fc) + autolayer(test_rev_ts)

在此处输入图像描述

**请注意,在示例中,我假设创新的正态分布并且只改变标准偏差,但是当通过innov参数手动添加它们时,它们也可以遵循任何其他任意分布。


推荐阅读