r - 估算常数并创建缺失假人
问题描述
在回归中处理缺失预测变量的常用策略是创建一个虚拟变量并填充一个常数。
例如:
lm(Y ~ X1 + replace(X2, is.na(X2), 0) + is.na(X2), df)
有没有更好的方法来实现这一点?
特别是如果我有 X3、X4 等也有缺失值,这将变得非常乏味,我最终会得到以下笨拙的公式:
Y ~ X1 + replace(X2, is.na(X2), 0) + is.na(X2) +
replace(X3, is.na(X3), 0) + is.na(X3) +
replace(X4, is.na(X4), 0) + is.na(X4)
能够估算列的平均值而不是零也很好。
数据:
df <- structure(list(Y = c(3.83, 22.73, 13.85, 14.09, 20.55, 18.51,
17.76, 9.42, 15.88, 27.81), X1 = 1:10, X2 = c(2L, NA, NA, 4L,
8L, 7L, 6L, 1L, 3L, 9L)), .Names = c("Y", "X1", "X2"), row.names = c(NA,
-10L), class = "data.frame")
解决方案
一种方法是使用函数来估算和创建虚拟变量,可能是这样的:
impvars <- function(dat) {
# Detect and impute
imp <- sapply(dat, function(x) {
if (any(is.na(x))) {
cbind(replace(x, is.na(x), mean(x, na.rm = TRUE)), is.na(x))
}
else {
x
}
})
rdf <- data.frame(do.call(cbind, imp))
# Name the columns
midx <- sapply(dat, function(x) any(is.na(x)))
vnames <- names(dat)
for (i in rev(seq_along(midx))) {
if (midx[i])
vnames <-
append(vnames, paste0(vnames[i], "_dum"), after = i)
}
names(rdf) <- vnames
return(rdf)
}
lm(Y ~ ., data = impvars(df))
Call:
lm(formula = Y ~ ., data = impvars(df))
Coefficients:
(Intercept) X1 X1_dum X2 X2_dum
0.3167 0.9622 -0.2523 2.0030 5.5531
数据:
df <- structure(list(Y = c(3.83, 22.73, 13.85, 14.09, 20.55, 18.51,
17.76, 9.42, 15.88, 27.81), X1 = c(1:5, NA, NA, 8:10), X2 = c(2L, NA, NA, 4L,
8L, 7L, 6L, 1L, 3L, 9L)), .Names = c("Y", "X1", "X2"), row.names = c(NA,
-10L), class = "data.frame")