首页 > 解决方案 > 为每个引导样本选择具有最大概率的类

问题描述

我正在尝试运行一个查询,创建一个 for 循环,以使用来自包 rattle.data 的数据创建引导程序(以 RainTomorrow 作为目标列的天气数据)。我试图为每个引导样本选择一个概率最大的类,然后预测具有最大票数的类。

使用此代码,我不断收到错误消息

if(!require(rpart)) install.packages("rpart") 
if(!require(rpart.plot)) install.packages("rpart.plot") 
if(!require(caret)) install.packages("caret") 
if(!require(rattle.data)) install.packages("rattle.data") 
if(!require(tidyverse)) install.packages("tidyverse") 
if(!require(ipred)) install.packages("ipred") 
if(!require(Metrics)) install.packages("Metrics") 
library(rpart)
library(rpart.plot)
library(rattle.data)
library(tidyverse)
library(caret)
library(ipred)
library(Metrics)

set.seed(500)

data <- weather

# creating train and test data
index <- createDataPartition(data$RainTomorrow, p = .6, list = FALSE)
train_data <- data[ index, ]
test_data <- data[-index, ]

## b ukol -> error in for each loop
nBoot = 10 #nr bootstrap samples

#create empty matrix [nr test data x nr bootstrap samples]to store bootstrap predictions
pred = matrix(data = NA, nrow = nrow(test_data), ncol = nBoot)

train_controls = rpart.control(minsplit = 6, maxdepth = 3)

for(b in 1:nBoot){
  #create bootstrap sample
  index.boot = sample(x=nrow(train_data), replace = T, size = nrow(train_data)) 
  data_boot = train_data[index.boot,]
  #fit data for the bootstrap sample
  boot.model  = rpart(RainTomorrow ~ ., 
                      data =data_boot, 
                      method = "anova", 
                      control = train_controls)
  #rpart.plot(boot.model)
  #save prediction for bootstrap
  pred[,b] = predict(boot.model, newdata= test_data )
}

#calculate prediction as mean of bootstrap predictions 

pred.bagged = rowMeans(pred)
print(rmse(actual = test_data$RainTomorrow, predicted = pred.bagged))

但运行此查询会给我一条警告消息:

在 Ops.factor(actual, predict) 中:'-' 对因子没有意义

而且我一生都无法弄清楚原因(机器学习的新手)。

编辑:仍在寻找有效的答案

标签: rmachine-learningstatistics-bootstrap

解决方案


发生错误是因为您试图从一个因子计算 RMSE:

pred.bagged = rowMeans(pred)
class(pred.bagged)
[1] "numeric"
class(test_data$RainTomorrow)
[1] "factor"

您可以将因子转换为数值,这是 rpart 在您指定 method = "anova" 时所做的,并计算 RMSE:

rmse(actual = as.numeric(test_data$RainTomorrow), predicted = pred.bagged)

RMSE 通常用于回归,对分类模型没有多大意义。对于分类,您可以使用 method="class" 并且对于评估使用准确度、f1 或 cohen 的 kappa,您可以看到下面的示例,其中带有插入符号的混淆矩阵:

for(b in 1:nBoot){
  #create bootstrap sample
  index.boot = sample(x=nrow(train_data), replace = T) 
  data_boot = train_data[index.boot,]
  #fit data for the bootstrap sample
  boot.model  = rpart(RainTomorrow ~ ., 
                      data =data_boot, 
                      method = "class", 
                      control = train_controls)
  #rpart.plot(boot.model)
  #save prediction for bootstrap
  pred[,b] = as.character(predict(boot.model, newdata= test_data ,type="class"))
}

# very crude way to get majority vote
pred.bagged = apply(pred,1,function(i){
names(sort(table(factor(i,levels=c("No","Yes")))))[2]
})
# convert to a factor, same levels as test_data$RainTomorrow
pred.bagged = factor(pred.bagged,levels=c("No","Yes"))

confusionMatrix(,test_data$RainTomorrow)
Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  120   0
       Yes   0  26

               Accuracy : 1          
                 95% CI : (0.9751, 1)
    No Information Rate : 0.8219     
    P-Value [Acc > NIR] : 3.672e-13  

                  Kappa : 1          

 Mcnemar's Test P-Value : NA         

            Sensitivity : 1.0000     
            Specificity : 1.0000     
         Pos Pred Value : 1.0000     
         Neg Pred Value : 1.0000     
             Prevalence : 0.8219     
         Detection Rate : 0.8219     
   Detection Prevalence : 0.8219     
      Balanced Accuracy : 1.0000     

       'Positive' Class : No         

推荐阅读