首页 > 解决方案 > 变异期间的递归滞后()?

问题描述

备用标题可能是“在变异中使用滞后来引用先前的行变异”

我想包括为前几行生成的值作为变异计算的输入。一些数据:

mydiamonds <- diamonds %>%
  mutate(Ideal = ifelse(cut == 'Ideal', 1, 0)) %>% 
  group_by(Ideal) %>% 
  mutate(rn = row_number()) %>% 
  arrange(Ideal, rn) %>% 
  mutate(CumPrice = cumsum(price)) %>% 
  mutate(InitialPrice = min(price)) %>% 
  select(Ideal, rn, CumPrice, InitialPrice)

看起来像这样:

mydiamonds %>% head
# A tibble: 6 x 4
# Groups:   Ideal [1]
  Ideal    rn CumPrice InitialPrice
  <dbl> <int>    <int>        <int>
1     0     1      326          326
2     0     2      653          326
3     0     3      987          326
4     0     4     1322          326
5     0     5     1658          326
6     0     6     1994          326

一个模型:

mod.diamonds = glm(CumPrice ~ log(lag(CumPrice)) +log(rn) + Ideal , family = "poisson", data = mydiamonds)

测试模型:

# new data, pretend we don't know CumPrice but want to use predictions to predict subsequent predictions
mydiamonds.testdata <- mydiamonds %>% select(-CumPrice)
# manual prediction based on lag(prediction), for the first row in each group use InitialPrice
## add coefficients as fields
coeffs <- mod.diamonds$coefficients
mydiamonds.testdata <- mydiamonds.testdata %>% 
  mutate(CoefIntercept = coeffs['(Intercept)'],
         CoefLogLagCumPrice = coeffs['log(lag(CumPrice))'],
         CoefLogRn = coeffs['log(rn)'],
         CoefIdeal = coeffs['Ideal']
         )

这是我的测试数据的样子:

 mydiamonds.testdata %>% head
# A tibble: 6 x 7
# Groups:   Ideal [1]
  Ideal    rn InitialPrice CoefIntercept CoefLogLagCumPrice CoefLogRn CoefIdeal
  <dbl> <int>        <int>         <dbl>              <dbl>     <dbl>     <dbl>
1     0     1          326        0.0931              0.987    0.0154 -0.000715
2     0     2          326        0.0931              0.987    0.0154 -0.000715
3     0     3          326        0.0931              0.987    0.0154 -0.000715
4     0     4          326        0.0931              0.987    0.0154 -0.000715
5     0     5          326        0.0931              0.987    0.0154 -0.000715
6     0     6          326        0.0931              0.987    0.0154 -0.000715

不能使用 predict(),因为我需要递归地预测前一天/行的预测输入到当天的位置。而是尝试使用系数进行手动预测:

# prediction
mydiamonds.testdata <- mydiamonds.testdata %>% 
  mutate(
    Prediction = CoefIntercept + 
      
      # here's the hard bit. If it's the first row in the group, use InitialPrice, else use the value of the previous prediction
      (CoefLogLagCumPrice * ifelse(rn == 1, InitialPrice, lag(Prediction))) + 
      
      (CoefLogRn * log(rn)) + 
      (CoefIdeal * Ideal)
    )

错误:mutate()输入有问题Prediction。未找到 x 对象“预测”ℹ 输入Prediction+.... ℹ 错误发生在第 1 组:理想 = 0。

我怎样才能以这种方式变异,我想在哪里引用之前的行变异?(除非它是第一行,在这种情况下使用 InitialPrice)

[编辑] 跟随评论者,我试了一下累积,这是我不太熟悉的功能:

mydiamonds.testdata <- mydiamonds.testdata %>% 
  mutate(
    Prediction = accumulate(.f = function(.) {
      
    .$CoefIntercept + 
      
      # here's the hard bit. If it's the first row in the group, use InitialPrice, else use the value of the previous prediction
      (.$CoefLogLagCumPrice * ifelse(.$rn == 1, .$InitialPrice, lag(.$Prediction))) + 
      
      (.$CoefLogRn * log(.$rn)) + 
      (.$CoefIdeal * .$Ideal)
      
      }))
Error: Problem with `mutate()` input `Prediction`.
x argument ".x" is missing, with no default
ℹ Input `Prediction` is `accumulate(...)`.
ℹ The error occurred in group 1: Ideal = 0.

标签: rdplyr

解决方案


正如你所说,你不习惯这个相当复杂的功能,这里有点解释。

purrr::accumulate()用于计算逐行递归操作。它的第一个参数.x是您要累积的变量。它的第二个参数.f是一个应该有两个参数的函数:当前结果cur和下一个评估值val。第一次.f调用 时,cur等于.x[1](默认情况下),然后等于返回的上一个结果.f

purrr::accumulate2()允许我们使用第二个变量.y进行迭代。的第一个值.y总是被忽略,因为.f此时已经知道要返回什么。因此,.y应该比 短一项.x

accumulate()不幸的是,只有accumulate2()你需要accumulate3()paccumulate()在 rn、Ideal 和 Price 上积累的地方。

但是,通过使用row_number()and cur_data(),您可以accumulate2()按照自己的意愿行事:

CoefIntercept = coeffs['(Intercept)']
CoefLogLagCumPrice = coeffs['log(lag(CumPrice))']
CoefLogRn = coeffs['log(rn)']
CoefIdeal = coeffs['Ideal']

mydiamonds.testdata <- mydiamonds %>% 
  ungroup() %>% 
  select(-CumPrice) %>% 
  mutate(
    Prediction = accumulate2(.x=InitialPrice, .y=row_number()[-1], 
                             .f=function(acc, nxt, row) {
      db=cur_data_all()
      rn = db$rn[row]
      Ideal = db$Ideal[row]
      CoefIntercept +
        (CoefLogLagCumPrice * acc) +
        (CoefLogRn * log(rn)) +
        (CoefIdeal * Ideal)
      
    }) %>% unlist()
  )
mydiamonds.testdata

# A tibble: 53,940 x 4
#     Ideal    rn InitialPrice Prediction
#     <dbl> <int>        <int>      <dbl>
# 1       0     1          326       326 
# 2       0     2          326       322.
# 3       0     3          326       318.
# 4       0     4          326       313.
# 5       0     5          326       309.
# 6       0     6          326       305.
# 7       0     7          326       301.
# 8       0     8          326       297.
# 9       0     9          326       294.
# 10      0    10          326       290.

编辑.init:使用参数还有另一种更简洁的方法,因为InitialPrice除了第一个值之外,该列从未真正使用过。这允许直接使用参数,但它不适用于具有更多协变量的更复杂的模型。

mydiamonds.testdata <- mydiamonds %>% 
  ungroup() %>% 
  select(-CumPrice) %>% 
  mutate(
    Prediction = accumulate2(.x=Ideal[-1], .y=rn[-1], 
                             .init=InitialPrice[1],
                             .f=function(rslt, Ideal, rn) {
      CoefIntercept +
        (CoefLogLagCumPrice * rslt) +
        (CoefLogRn * log(rn)) +
        (CoefIdeal * Ideal)
      
    }) %>% unlist()
  )

推荐阅读