r - 如何迭代地删除值,以便在 R 中的回归中没有负面预测
问题描述
这是我的数据
data=structure(list(session_id = c(13532925L, 13532921L, 13532918L,
13532917L, 13532912L, 13532910L, 13532909L, 13532908L, 13532907L,
13532900L), weekday_session = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L), hour_session = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), session_price = c(18630L, 5410L, 20790L, 5410L, 7780L, 16590L,
5410L, 9870L, 4190L, 13770L), flight_type = c(1L, 1L, 1L, 1L,
2L, 1L, 1L, 1L, 1L, 1L), airline_id = c(156L, 156L, 156L, 156L,
238L, 156L, 156L, 238L, 238L, 156L), meta_flight_type = c(1L,
1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), days_to_flight = c(15L,
1L, 31L, 1L, 19L, 3L, 0L, 9L, 41L, 3L), flight_area = structure(c(3L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("RU", "w", "W"
), class = "factor"), search_category_id = structure(c(2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("13", "other"), class = "factor"),
count_passengers = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
1L), is_children_session = c(0L, 0L, 0L, 0L, 0L, 1L, 1L,
1L, 1L, 0L), is_infant_session = c(0L, 0L, 0L, 0L, 1L, 0L,
1L, 0L, 0L, 0L), trip_duration_min = c(525L, 120L, 745L,
120L, 280L, 725L, 120L, 140L, 220L, 460L), flight_duration_min = c(400L,
120L, 440L, 120L, 280L, 570L, 120L, 140L, 220L, 460L), stopovers = c(1L,
0L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), prediction = c(0.0556,
0.3479, 0.0646, 0.3479, 0.1514, 0.1906, 0.293, 0.3693, 0.1871,
0.1319)), class = "data.frame", row.names = c(NA, -10L))
这是我的代码
mydat<- read.csv("data.csv", sep=";",dec=",")
View(mydat)
str(mydat)
mydat$session_id<-NULL
#split sample on train and sample
index <- sample(1:nrow(mydat),round(0.70*nrow(mydat)))
train <- mydat[index,]
test <- mydat[-index,]
#build the model
mymodel=lm(prediction~.,data=mydat)
summary(mymodel)
结果我得到负值,例如-0,023。但是不能有负值,即使是零也不能。
如何在 R 中使其迭代迭代并从不同列中删除值,直到在预测结果中,值不会小于 0.01?即,它需要为 beta 系数选择这样的值,以便等式中的结果不会得到小于或等于零的值。有办法吗?
解决方案
TL;DR - 这些负系数不是您的模型中的缺陷,它们是您的数据的一个特征。您的许多变量与您的预测呈负相关。鉴于大部分数据与您的预测负相关,您能否解释为什么您期望或要求模型系数为正?
听起来您想从数据中删除行,直到系数估计全部为正。你确定要这么做吗?负系数并不一定意味着您的模型将返回小于零的预测值。在你这样做之前,也许你应该可视化你的数据并找出发生了什么。
这是一个例子:
library(ggplot)
ggplot(mydat, aes(x = trip_duration_min, y = prediction)) +
geom_point() +
geom_smooth(method = "lm")
prediction
您的标签和trip_duration_min
变量之间存在很强的负相关。
cor(mydat$trip_duration_min, mydat$prediction)
[1] -0.7856956
要使这种相关性为正,您需要从示例数据集中删除所有包含trip_duration_min > 400
(超过一半数据)的行。您可以这样做,但整体相关性和整体模型较弱。你的预测不会很好。
cor(mydat$trip_duration_min[which(mydat$trip_duration_min > 400)],
mydat$prediction[which(mydat$trip_duration_min > 400)])
[1] 0.1650505
此外,如果您删除这些行,则无法保证这样做会对其他系数产生理想的结果:
library(dplyr)
mymodel = lm(prediction~., data = mydat %>% filter(trip_duration_min > 400))
summary(mymodel)
Call:
lm(formula = prediction ~ ., data = mydat %>% filter(trip_duration_min >
400))
Residuals:
ALL 4 residuals are 0: no residual degrees of freedom!
Coefficients: (12 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.319e-01 NA NA NA
weekday_session NA NA NA NA
hour_session NA NA NA NA
session_price 2.082e-05 NA NA NA
flight_type NA NA NA NA
airline_id NA NA NA NA
meta_flight_type -8.600e-02 NA NA NA
days_to_flight -7.622e-03 NA NA NA
flight_areaW NA NA NA NA
search_category_idother NA NA NA NA
count_passengers NA NA NA NA
is_children_session NA NA NA NA
is_infant_session NA NA NA NA
trip_duration_min NA NA NA NA
flight_duration_min NA NA NA NA
stopovers NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 3 and 0 DF, p-value: NA
NA
由于行数太少,这个模型生成了很多值,但剩余的系数仍然是负数。
请注意,这些系数并不意味着您的预测值为负数。如果你是predict
你的模型,你会发现它产生的结果大于零。拟合非常好,但那是因为您问题中的示例数据的观测值少于变量。
predict(mymodel)
1 2 3 4 5 6 7 8 9 10
0.0556 0.3479 0.0646 0.3479 0.1514 0.1906 0.2930 0.3693 0.1871 0.1319
以下是每个变量的相关系数的完整概览prediction
:
library(dplyr)
library(tidyr)
mydat %>%
mutate(across(everything(), as.numeric)) %>%
cor %>%
as.data.frame() %>%
rownames_to_column(var = "var1") %>%
pivot_longer(cols = -var1, names_to = "var2", values_to = "cor") %>%
filter(var1 != var2,
!is.na(cor),
var1 == "prediction") %>%
arrange(cor)
# A tibble: 13 x 3
var1 var2 cor
<chr> <chr> <dbl>
1 prediction trip_duration_min -0.786
2 prediction flight_duration_min -0.781
3 prediction session_price -0.724
4 prediction stopovers -0.646
5 prediction days_to_flight -0.526
6 prediction search_category_id -0.271
7 prediction flight_type -0.186
8 prediction flight_area -0.111
9 prediction is_infant_session 0.0369
10 prediction airline_id 0.129
11 prediction count_passengers 0.287
12 prediction is_children_session 0.336
13 prediction meta_flight_type 0.344
只有五个变量与 正相关prediction
。你可以只用这五个来建立一个模型。你仍然会在模型中得到负系数。
model_2 <- lm(prediction ~ is_infant_session + airline_id +
count_passengers + is_children_session +
meta_flight_type, data = mydat)
summary(model_2)
Call:
lm(formula = prediction ~ is_infant_session + airline_id + count_passengers +
is_children_session + meta_flight_type, data = mydat)
Residuals:
1 2 3 4 5 6 7 8 9 10
-1.949e-01 9.743e-02 -3.365e-02 9.743e-02 8.327e-17 4.163e-17 -1.388e-16 9.110e-02 -9.110e-02 3.365e-02
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.082271 0.346762 -0.237 0.824
is_infant_session -0.049817 0.231749 -0.215 0.840
airline_id 0.001256 0.003878 0.324 0.762
count_passengers -0.015367 0.409196 -0.038 0.972
is_children_session 0.092350 0.168622 0.548 0.613
meta_flight_type 0.152217 0.125684 1.211 0.293
Residual standard error: 0.1377 on 4 degrees of freedom
Multiple R-squared: 0.3961, Adjusted R-squared: -0.3587
F-statistic: 0.5248 on 5 and 4 DF, p-value: 0.7522
但是,当您预测此模型时,您仍然会得到具有正值的结果。
predict(model_2)
1 2 3 4 5 6 7 8 9 10
0.2504667 0.2504667 0.0982500 0.2504667 0.1514000 0.1906000 0.2930000 0.2782000 0.2782000 0.0982500
推荐阅读
- php - PHP - filter_id() 用于什么?
- java - 如何验证当前设备中是否存在手机号码
- flutter - 电子邮件验证不随电子邮件文本字段更改而更新
- javascript - 尝试使用 Javascript ES6 模块....现在其他函数在脚本运行时报告为未定义
- html - 在fabric.js中如何获取鼠标坐标相对于背景图像的位置,而不是canvas.getPointer
- angular - 将错误从服务传递到组件
- eclipse - High Sierra 上的 Eclipse - 用于深色主题的白色滚动条
- angular - 动态隐藏输入和标签
- python - 如何限制用户在 Python 3 中输入超过 1 位数字
- javascript - Vuetify v-app 无法应用类属性