首页 > 解决方案 > 按组在插入符号中训练时间序列模型

问题描述

我有一个如下的数据集

set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
                  y  = rnorm(n = 6 * 150, mean = 5, sd = 2),
                  x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
                  x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
                  x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
                  x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
                  x5 = sample(c(1, 0), size = 6 * 150, replace = T))

foo[, period := 1:.N, by = group]

问题:我想y提前一步预测,对于每个group,使用变量x1, ..., x5

我想运行几个模型caret来决定我将使用哪个。

截至目前,我正在使用时间片循环运行它

window.length <- 115
timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length,
                            horizon           = 1, 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

model_list <- list()
for(g in unique(foo$group)){
  for(model in c("xgbTree", "earth", "cubist")){
    dat <- foo[group == g][, c('group', 'period') := NULL]
    model_list[[g]][[model]] <- train(y ~ . - 1,
                                      data = dat,
                                      method = model, 
                                      trControl = timecontrol)

  }
}

但是,我想同时运行所有组,使用虚拟变量来识别每个组,比如

dat <- cbind(foo,  model.matrix(~ group- 1, foo))
            y         x1       x2       x3            x4 x5 period groupA groupB groupC groupD groupE groupF
  1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04  1      1      1      0      0      0      0      0
  2: 3.442213  8.6558983 32.41881 45.70801  3.255423e-01  1      1      0      1      0      0      0      0
  3: 3.485286  7.7295448 21.99022 56.42133  8.668391e+00  1      1      0      0      1      0      0      0
  4: 9.659601  0.9166456 30.34609 55.72661 -7.666063e+00  1      1      0      0      0      1      0      0
  5: 5.567950  3.0306864 22.07813 52.21099  5.377153e-01  1      1      0      0      0      0      1      0

但仍然使用正确的时间顺序运行时间序列timeslice

有没有办法在 中声明time变量trainControl,所以我的one step ahead预测在这种情况下,每轮使用另外 6 个观察值并删除前 6 个观察值?

我可以通过对数据进行排序并弄乱horizon参数来做到这一点(给定n组,按时间变量排序并 put horizon = n),但是如果组数发生变化,则必须更改。并且initial.window必须是time * n_groups

timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length * length(unique(foo$group)),
                            horizon           = length(unique(foo$group)), 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

还有其他方法吗?

标签: rtime-seriesr-carettraining-data

解决方案


我认为您正在寻找的答案实际上很简单。您可以使用该skip参数trainControl()在每个训练/测试集之后跳过所需的观察次数。这样,每个组周期只预测一次,训练组和测试组之间永远不会分割相同的周期,并且没有信息泄漏。

使用您提供的示例,如果您设置skip = 6horizon = 6(组数)和initialWindow = 115,则第一个测试集将包括期间 116 的所有组,下一个测试集将包括期间 117 的所有组,依此类推。

library(caret)
library(tidyverse)

set.seed(503)
foo <- tibble(group = rep(LETTERS[1:6], 150),
                  y  = rnorm(n = 6 * 150, mean = 5, sd = 2),
                  x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
                  x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
                  x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
                  x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
                  x5 = sample(c(1, 0), size = 6 * 150, replace = T)) %>% 
  group_by(group) %>% 
  mutate(period = row_number()) %>% 
  ungroup() 

dat <- cbind(foo,  model.matrix(~ group- 1, foo)) %>% 
  select(-group)

window.length <- 115

timecontrol   <- trainControl(
  method            = 'timeslice',
  initialWindow     = window.length * length(unique(foo$group)),
  horizon           = length(unique(foo$group)),
  skip              = length(unique(foo$group)),
  selectionFunction = "best",
  fixedWindow       = TRUE,
  savePredictions   = 'final'
)

model_names <- c("xgbTree", "earth", "cubist")
fits <- map(model_names,
            ~ train(
              y ~ . - 1,
              data = dat,
              method = .x,
              trControl = timecontrol
            )) %>% 
  set_names(model_names)

推荐阅读