r - 逐步比较系数
问题描述
我用mtcars
数据来解释我的问题。例如,我试图估计变量作为因变量的回归系数cyl
,mpg
并通过包含其他其他变量来评估系数的变化。
第 1 步:lm(mpg ~ cyl, data = df)
得到粗系数cyl
第 2 步:在第 1 步中将其他每个变量一次添加到模型中,选择系数变化 (%) 最大的cyl
变量,并将该变量添加到上述模型中。
步骤3:重复步骤2,将剩余变量中的每个变量添加到上面的模型中,再次找到'cyl'系数变化最大的那个;
步骤.....:重复直到所有变量都包含在模型中。
library(dplyr)
df <- mtcars %>% select(mpg, cyl, disp, hp, wt)
my_fun1 <- function(df=data) {
out_df <- data.frame(matrix(ncol = 0, nrow = (length(df) - 1)))
md_1 <- lm(mpg ~ cyl, data = df)
out_df$Models[1] <- "Crude"
out_df$Estimate[1] <- md_1$coefficients[2]
pre_change <- 0
to_rm <- 0
for (k in 2:(length(df)-1)) {
for (i in 3:length(df)) {
if (!i %in% to_rm) {
md_tmp <- update(md_1, . ~ . + df[[i]])
change <- abs(100*(md_tmp$coefficients[2] - md_1$coefficients[2])/md_1$coefficients[2])
dif <- md_tmp$coefficients[2] - md_1$coefficients[2]
if (change >= pre_change) {
out_df$Estimate[k] <- md_tmp$coefficients[2]
out_df$Models[k] <- paste("+", names(df)[[i]])
out_df$Diff[k] <- md_tmp$coefficients[2] - md_1$coefficients[2]
picked <- names(df)[[i]]
picked_i = i
pre_change <- out_df$`Change (%)`[k] <- change
}
}
}
to_rm <- c(to_rm, picked_i)
md_1 <- update(md_1, .~. + eval(as.name(paste(picked))))
pre_change = 0
}
out_df
}
my_fun1(df = df)
在上面运行之后,我希望以cyl
以下格式获得每一步的回归系数。
Models Estimate Diff Change (%)
1 Crude -2.875790 NA NA
2 + wt -1.507795 1.367995 47.56937
3 + hp -1.227420 0.280375 18.59504
4 + disp -1.227420 1.037274 45.80194
但是,步骤 1 和 2 提供了正确的结果,步骤 3 和 4 不正确。任何建议,将不胜感激。
解决方案
您可以通过使用 R 的矢量化属性并避免痛苦的for
循环来使这更容易一些。
my_fun2 <- function(dat, i) {
fit <- lm(mpg ~ cyl, data=dat)
crude <- fit$coef[2]
# vectorized evaluation function
# fits model and calculates coef and change
evav <- Vectorize(function(i) {
# create extension string from the "i"s
cf.ext <- paste(names(dat)[i], collapse="+")
# update formula with extensions
beta <- update(fit, as.formula(
paste0("mpg~cyl",
# paste already accepted coefs case they exist
if (length(bests) != 0) {
paste("", names(dat)[bests], sep="+", collapse="")
},
"+", cf.ext)
))$coe[2]
# calculate Diff
beta.d <- abs(crude - beta)
# calculate Change %
beta.d.perc <- 100 / crude*beta.d
# set an attribute "cf.name" to be able to identify coef later
return(`attr<-`(c(beta=beta, beta.d=beta.d,
beta.d.perc=beta.d.perc),
"cf.name", cf.ext))
}, SIMPLIFY=FALSE) # simplifying would strip off attributes
# create empty vector bests
bests <- c()
# lapply evav() over the "i"s
res <- lapply(i, function(...) {
# run evav()
i.res <- evav(i)
# find largest change
largest <- which.max(mapply(`[`, i.res, 2))
# update "bests" vector within function environment with `<<-`
bests <<- c(bests, i[largest])
# same with the "i"s
i <<- i[-largest]
return(i.res[[largest]])
})
# summarize everything into matrix and give dimnames
res <- `dimnames<-`(rbind(c(crude, NA, NA),
do.call(rbind, res)),
list(
c("crude",
paste0("+ ", mapply(attr, res, "cf.name"))),
c("Estimate", "Diff", "Change (%)")))
return(res)
}
用法
my_fun2(mtcars[c("mpg", "cyl", "disp", "hp", "wt")], i=3:5)
# Estimate Diff Change (%)
# crude -2.8757901 NA NA
# + wt -1.5077950 1.367995 -47.56937
# + hp -0.9416168 1.934173 -67.25711
# + disp -1.2933197 1.582470 -55.02733
查看
检查Diff
s:
fit <- lm(mpg ~ cyl, data=mtcars[c("mpg", "cyl", "disp", "hp", "wt")])
sapply(c("disp", "hp", "wt"), function(x)
unname(abs(fit$coe[2] - update(fit, as.formula(paste("mpg~cyl+", x)))$coe[2])))
# disp hp wt
# 1.2885133 0.6110965 1.3679952
sapply(c("disp", "hp"), function(x)
unname(abs(fit$coe[2] - update(fit, as.formula(paste("mpg~cyl+wt+", x)))$coe[2])))
# disp hp
# 1.090847 1.934173
sapply(c("disp"), function(x)
unname(abs(fit$coe[2] - update(fit, as.formula(paste("mpg~cyl+wt+hp+", x)))$coe[2])))
# disp
# 1.58247
应该看起来不错。
推荐阅读
- list - 如何动态地将数据添加到这样的列表中 [Dart]
- javascript - 如何在本地机器上运行 puppeteer 24/7
- swift - 在应用程序之间共享核心数据。应用程序组、iCloud
- javascript - 从另一个 Hook 访问 setState 方法
- javascript - 将 API 数据推送到 Chart.js
- python - python cuDF groupby 适用于有序数据
- r - 如何根据列表删除多个col值
- python - 如何使用属性文件或xml文件根据提供的参数显示数据
- json - 这是格式不正确的 JSON 吗?
- scala - 如何删除scala spark中的多个字符?