首页 > 解决方案 > 使用 Ranger 计算多分类的混淆矩阵或列联表时出错

问题描述

我正在调用 ranger 来对大型混合数据框架的多分类问题进行建模(其中一些分类变量的级别超过 53 个)。训练和测试运行没有任何问题。但是,解释混淆矩阵/列联表会打嗝。

我使用 iris 数据来解释我面临的困难,将 Species 视为分类变量,

library(ranger)
library(caret)

# Data
idx = sample(nrow(iris),100)
data = iris

# Split data sets
Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)
or
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

遇到以下困难:

table(Test_Set$Species, probabilitiesSpecies$predictions)

Error in table(Test_Set$Species, probabilitiesSpecies$predictions) : 
all arguments must have the same length

或者

caret::confusionMatrix(Test_Set$Species, probabilitiesSpecies$predictions)
or
caret::confusionMatrix(table(Test_Set$Species, max.col(probabilitiesSpecies)-1))
gives
Error: `data` and `reference` should be factors with the same levels.

然而,下面显示的二分类是有效的:

idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="virginica",1,0))

Train_Set = data[idx,]
Test_Set = data[-idx,]

# Train
Species.ranger <- ranger(Species ~ ., ,data=Train_Set,importance="impurity", save.memory = TRUE, probability=TRUE)

# Test
probabilitiesSpecies <- as.data.frame(predict(Species.ranger, data = Test_Set,type='response', verbose = TRUE)$predictions)

caret::confusionMatrix(table(max.col(probabilitiesSpecies)-1, Test_Set$Species))

如何解决这个问题以进行多分类以获得混淆矩阵?我也将其设置为单独的线程(使用 ranger 计算多分类混淆矩阵时出错

标签: rmachine-learningclassificationconfusion-matrixr-ranger

解决方案


ranger-documentation 中,当probabilities = TRUE,

使用概率选项和因子因变量生成概率森林。在这里,节点杂质用于分裂,就像在分类森林中一样。预测是每个样本的类别概率。与其他实现相比,每棵树都返回一个概率估计,并且这些估计是森林概率估计的平均值。有关详细信息,请参阅 Malley 等人。(2012)。

IE。当设置为 时TRUE,您将获得概率估计值,然后您可以根据自己的阈值对其进行分类。但是,如果设置为 ,我不知道默认决策规则FALSE

无论如何,您的方法应该如下,

Species.ranger <- ranger(
        Species ~ .,
        data = Train_Set,
        importance ="impurity",
        save.memory = TRUE, 
        probability = FALSE
)

然后可以通过confusionMatrix以下方式评估其性能,

probabilitiesSpecies <- predict(
        Species.ranger,
        data = Test_Set,
        verbose = TRUE
        )

table(
        probabilitiesSpecies$predictions,
        Test_Set$Species
) %>% confusionMatrix()

输出

Confusion Matrix and Statistics

            
             setosa versicolor virginica
  setosa         17          0         0
  versicolor      0         16         1
  virginica       0          0        16

Overall Statistics
                                          
               Accuracy : 0.98            
                 95% CI : (0.8935, 0.9995)
    No Information Rate : 0.34            
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.97            
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                   1.00            1.0000           0.9412
Specificity                   1.00            0.9706           1.0000
Pos Pred Value                1.00            0.9412           1.0000
Neg Pred Value                1.00            1.0000           0.9706
Prevalence                    0.34            0.3200           0.3400
Detection Rate                0.34            0.3200           0.3200
Detection Prevalence          0.34            0.3400           0.3200
Balanced Accuracy             1.00            0.9853           0.9706

推荐阅读