首页 > 解决方案 > 如何迭代地删除值,以便在 R 中的回归中没有负面预测

问题描述

这是我的数据

data=structure(list(session_id = c(13532925L, 13532921L, 13532918L, 
13532917L, 13532912L, 13532910L, 13532909L, 13532908L, 13532907L, 
13532900L), weekday_session = c(7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L), hour_session = c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L), session_price = c(18630L, 5410L, 20790L, 5410L, 7780L, 16590L, 
5410L, 9870L, 4190L, 13770L), flight_type = c(1L, 1L, 1L, 1L, 
2L, 1L, 1L, 1L, 1L, 1L), airline_id = c(156L, 156L, 156L, 156L, 
238L, 156L, 156L, 238L, 238L, 156L), meta_flight_type = c(1L, 
1L, 0L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), days_to_flight = c(15L, 
1L, 31L, 1L, 19L, 3L, 0L, 9L, 41L, 3L), flight_area = structure(c(3L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L), .Label = c("RU", "w", "W"
), class = "factor"), search_category_id = structure(c(2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("13", "other"), class = "factor"), 
    count_passengers = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
    1L), is_children_session = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 
    1L, 1L, 0L), is_infant_session = c(0L, 0L, 0L, 0L, 1L, 0L, 
    1L, 0L, 0L, 0L), trip_duration_min = c(525L, 120L, 745L, 
    120L, 280L, 725L, 120L, 140L, 220L, 460L), flight_duration_min = c(400L, 
    120L, 440L, 120L, 280L, 570L, 120L, 140L, 220L, 460L), stopovers = c(1L, 
    0L, 2L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), prediction = c(0.0556, 
    0.3479, 0.0646, 0.3479, 0.1514, 0.1906, 0.293, 0.3693, 0.1871, 
    0.1319)), class = "data.frame", row.names = c(NA, -10L))

这是我的代码

mydat<- read.csv("data.csv", sep=";",dec=",")
View(mydat)
str(mydat)
mydat$session_id<-NULL



#split sample on train and sample
index <- sample(1:nrow(mydat),round(0.70*nrow(mydat)))
train <- mydat[index,]
test <- mydat[-index,]

#build the model
mymodel=lm(prediction~.,data=mydat)
summary(mymodel)

结果我得到负值,例如-0,023。但是不能有负值,即使是零也不能。

如何在 R 中使其迭代迭代并从不同列中删除值,直到在预测结果中,值不会小于 0.01?即,它需要为 beta 系数选择这样的值,以便等式中的结果不会得到小于或等于零的值。有办法吗?

标签: rdplyrdata.table

解决方案


TL;DR - 这些负系数不是您的模型中的缺陷,它们是您的数据的一个特征。您的许多变量与您的预测呈负相关。鉴于大部分数据与您的预测负相关,您能否解释为什么您期望或要求模型系数为正?

听起来您想从数据中删除行,直到系数估计全部为正。你确定要这么做吗?负系数并不一定意味着您的模型将返回小于零的预测值。在你这样做之前,也许你应该可视化你的数据并找出发生了什么。

这是一个例子:

library(ggplot)
ggplot(mydat, aes(x = trip_duration_min, y = prediction)) +
  geom_point() +
  geom_smooth(method = "lm")

在此处输入图像描述

prediction您的标签和trip_duration_min变量之间存在很强的负相关。

cor(mydat$trip_duration_min, mydat$prediction)
[1] -0.7856956

要使这种相关性为正,您需要从示例数据集中删除所有包含trip_duration_min > 400(超过一半数据)的行。您可以这样做,但整体相关性和整体模型较弱。你的预测不会很好。

cor(mydat$trip_duration_min[which(mydat$trip_duration_min > 400)], 
    mydat$prediction[which(mydat$trip_duration_min > 400)])
[1] 0.1650505

此外,如果您删除这些行,则无法保证这样做会对其他系数产生理想的结果:

library(dplyr)
mymodel = lm(prediction~., data = mydat %>% filter(trip_duration_min > 400))
summary(mymodel)
Call:
lm(formula = prediction ~ ., data = mydat %>% filter(trip_duration_min > 
    400))

Residuals:
ALL 4 residuals are 0: no residual degrees of freedom!

Coefficients: (12 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             -1.319e-01         NA      NA       NA
weekday_session                 NA         NA      NA       NA
hour_session                    NA         NA      NA       NA
session_price            2.082e-05         NA      NA       NA
flight_type                     NA         NA      NA       NA
airline_id                      NA         NA      NA       NA
meta_flight_type        -8.600e-02         NA      NA       NA
days_to_flight          -7.622e-03         NA      NA       NA
flight_areaW                    NA         NA      NA       NA
search_category_idother         NA         NA      NA       NA
count_passengers                NA         NA      NA       NA
is_children_session             NA         NA      NA       NA
is_infant_session               NA         NA      NA       NA
trip_duration_min               NA         NA      NA       NA
flight_duration_min             NA         NA      NA       NA
stopovers                       NA         NA      NA       NA

Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:    NaN 
F-statistic:   NaN on 3 and 0 DF,  p-value: NA

NA由于行数太少,这个模型生成了很多值,但剩余的系数仍然是负数。

请注意,这些系数并不意味着您的预测值为负数。如果你是predict你的模型,你会发现它产生的结果大于零。拟合非常好,但那是因为您问题中的示例数据的观测值少于变量。

predict(mymodel)
     1      2      3      4      5      6      7      8      9     10 
0.0556 0.3479 0.0646 0.3479 0.1514 0.1906 0.2930 0.3693 0.1871 0.1319 

以下是每个变量的相关系数的完整概览prediction

library(dplyr)
library(tidyr)
mydat %>%
  mutate(across(everything(), as.numeric)) %>%
  cor %>%
  as.data.frame() %>%
  rownames_to_column(var = "var1") %>%
  pivot_longer(cols = -var1, names_to = "var2", values_to = "cor") %>%
  filter(var1 != var2,
         !is.na(cor),
         var1 == "prediction") %>%
  arrange(cor)
# A tibble: 13 x 3
   var1       var2                    cor
   <chr>      <chr>                 <dbl>
 1 prediction trip_duration_min   -0.786 
 2 prediction flight_duration_min -0.781 
 3 prediction session_price       -0.724 
 4 prediction stopovers           -0.646 
 5 prediction days_to_flight      -0.526 
 6 prediction search_category_id  -0.271 
 7 prediction flight_type         -0.186 
 8 prediction flight_area         -0.111 
 9 prediction is_infant_session    0.0369
10 prediction airline_id           0.129 
11 prediction count_passengers     0.287 
12 prediction is_children_session  0.336 
13 prediction meta_flight_type     0.344 

只有五个变量与 正相关prediction。你可以只用这五个来建立一个模型。你仍然会在模型中得到负系数。

model_2 <- lm(prediction ~ is_infant_session + airline_id + 
              count_passengers + is_children_session + 
              meta_flight_type, data = mydat)
summary(model_2)
Call:
lm(formula = prediction ~ is_infant_session + airline_id + count_passengers + 
    is_children_session + meta_flight_type, data = mydat)

Residuals:
         1          2          3          4          5          6          7          8          9         10 
-1.949e-01  9.743e-02 -3.365e-02  9.743e-02  8.327e-17  4.163e-17 -1.388e-16  9.110e-02 -9.110e-02  3.365e-02 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)         -0.082271   0.346762  -0.237    0.824
is_infant_session   -0.049817   0.231749  -0.215    0.840
airline_id           0.001256   0.003878   0.324    0.762
count_passengers    -0.015367   0.409196  -0.038    0.972
is_children_session  0.092350   0.168622   0.548    0.613
meta_flight_type     0.152217   0.125684   1.211    0.293

Residual standard error: 0.1377 on 4 degrees of freedom
Multiple R-squared:  0.3961,    Adjusted R-squared:  -0.3587 
F-statistic: 0.5248 on 5 and 4 DF,  p-value: 0.7522

但是,当您预测此模型时,您仍然会得到具有正值的结果。

predict(model_2)
        1         2         3         4         5         6         7         8         9        10 
0.2504667 0.2504667 0.0982500 0.2504667 0.1514000 0.1906000 0.2930000 0.2782000 0.2782000 0.0982500 

推荐阅读