首页 > 解决方案 > 估算常数并创建缺失假人

问题描述

在回归中处理缺失预测变量的常用策略是创建一个虚拟变量并填充一个常数。

例如:

lm(Y ~ X1 + replace(X2, is.na(X2), 0) + is.na(X2), df)

有没有更好的方法来实现这一点?

特别是如果我有 X3、X4 等也有缺失值,这将变得非常乏味,我最终会得到以下笨拙的公式:

Y ~ X1 + replace(X2, is.na(X2), 0) + is.na(X2) + 
         replace(X3, is.na(X3), 0) + is.na(X3) + 
         replace(X4, is.na(X4), 0) + is.na(X4)

能够估算列的平均值而不是零也很好。

数据:

df <- structure(list(Y = c(3.83, 22.73, 13.85, 14.09, 20.55, 18.51, 
17.76, 9.42, 15.88, 27.81), X1 = 1:10, X2 = c(2L, NA, NA, 4L, 
8L, 7L, 6L, 1L, 3L, 9L)), .Names = c("Y", "X1", "X2"), row.names = c(NA, 
-10L), class = "data.frame")

标签: rregression

解决方案


一种方法是使用函数来估算和创建虚拟变量,可能是这样的:

impvars <-  function(dat) {
  # Detect and impute
  imp <- sapply(dat, function(x) {
    if (any(is.na(x))) {
      cbind(replace(x, is.na(x), mean(x, na.rm = TRUE)), is.na(x))
    }
    else {
      x
    }
  })

  rdf <- data.frame(do.call(cbind, imp))

  # Name the columns
  midx <- sapply(dat, function(x) any(is.na(x)))
  vnames <- names(dat)
  for (i in rev(seq_along(midx))) {
    if (midx[i])
      vnames <-
        append(vnames, paste0(vnames[i], "_dum"), after = i)
  }
  names(rdf) <- vnames

  return(rdf)

}

lm(Y ~ ., data = impvars(df))

Call:
lm(formula = Y ~ ., data = impvars(df))

Coefficients:
(Intercept)           X1       X1_dum           X2       X2_dum  
     0.3167       0.9622      -0.2523       2.0030       5.5531 

数据:

df <- structure(list(Y = c(3.83, 22.73, 13.85, 14.09, 20.55, 18.51, 
                           17.76, 9.42, 15.88, 27.81), X1 = c(1:5, NA, NA, 8:10), X2 = c(2L, NA, NA, 4L, 
                                                                         8L, 7L, 6L, 1L, 3L, 9L)), .Names = c("Y", "X1", "X2"), row.names = c(NA, 
                                                                                                                                              -10L), class = "data.frame")

推荐阅读