首页 > 解决方案 > 为什么 R 中不同的随机森林实现会产生不同的结果?

问题描述

我承认这是一个有点困难的问题,除了编写它们的人之外,要问任何人,但我在 R 中随机森林的三个不同版本中获得了持续不同的结果。

有问题的三种方法是 randomForest 包、插入符号中的“rf”方法和 ranger 包。代码包含在下面。

有问题的数据是一个例子;我在其他类似数据的规范中看到了类似的东西。

LHS 变量:政党标识(Dem、Rep、Indep.)。右手边的预测变量是人口统计数据。为了弄清楚randomForest 包中一些奇怪的结果到底是怎么回事,我尝试在其他两种方法中实现相同的模型。我发现它们不会重现该特定异常;这特别奇怪,因为据我所知,插入符号中的 rf 方法只是对 randomForest 包的间接使用。

我在每个实现中运行的三个规范是(1)三类分类,(2)删除独立类别,以及(3)与 2 相同但将单个观察结果加扰到“独立”以在模型中保留三个类别,这应该会产生与 2 相似的结果。据我所知,在任何情况下都不应该有任何过度或不足的抽样来解释结果。

我还注意到以下趋势:

  1. randomForest 包是唯一一个完全混乱的包,只有两个类别。
  2. ranger 包始终将(正确和错误地)更多的观察识别为独立的。
  3. ranger 包在整体预测准确性方面总是稍差一些。
  4. caret 包在整体准确性上与 randomForest 相似(略高),但始终在更常见的类中更好,在不太常见的类中更差。这很奇怪,因为据我所知,在这两种情况下我都没有实现任何过采样或欠采样,而且我认为插入符号依赖于 randomForest 包。

下面我包含了代码和混淆矩阵,显示了有问题的差异。每次重新运行代码都会在混淆矩阵中产生相似的趋势;这不是“任何单独的运行都可能产生奇怪的结果”的问题。

有谁知道为什么这些包会始终产生略微不同的结果(并且在 randomForest 中的链接问题的情况下,非常不同)结果一般,甚至更好,为什么它们会以这种特定方式有所不同?例如,我应该注意这些包的包中是否有某种样本加权/分层?

代码:

num_trees=1001
var_split=3

load("three_cat.Rda")
rf_three_cat  <-randomForest(party_id_3_cat~{RHS Vars},
                         data=three_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)

two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")    
two_cat$party_id_3_cat<-droplevels(two_cat$party_id_3_cat)
rf_two_cat    <-randomForest(party_id_3_cat~{RHS Vars},
                         data=two_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)
scramble_independent<-subset(three_cat,party_id_3_cat!="2. Independents")
scramble_independent[1,19]<-"2. Independents"
scramble_independent<- data.frame(lapply(scramble_independent, as.factor), stringsAsFactors=TRUE)
rf_scramble<-randomForest(party_id_3_cat~{RHS Vars},
                      data=scramble_independent,
                      ntree=num_trees,
                      mtry=var_split,
                      type="classification",
                      importance=TRUE,confusion=TRUE)

ranger_2<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=two_cat,
             num.trees=num_trees,mtry=var_split)
ranger_3<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=three_cat,
             num.trees=num_trees,mtry=var_split)
ranger_scram<-ranger(formula=party_id_3_cat~{RHS Vars},
                 data=scramble_independent,
                 num.trees=num_trees,mtry=var_split)

rfControl <- trainControl(method = "none", number = 1, repeats = 1)
rfGrid <- expand.grid(mtry = c(3))
rf_caret_3        <- train(party_id_3_cat~{RHS Vars},
                      data=three_cat,
                      method="rf", ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)
rf_caret_2        <- train(party_id_3_cat~{RHS Vars},
                data = two_cat,
                method = "rf",ntree=num_trees,
                type="classification",
                importance=TRUE,confusion=TRUE,
                trControl = rfControl, tuneGrid = rfGrid)
rf_caret_scramble <- train(party_id_3_cat~{RHS Vars},
                      data = scramble_independent,
                      method = "rf",ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)

rf_three_cat$confusion
ranger_3$confusion.matrix
rf_caret_3$finalModel["confusion"]

rf_two_cat$confusion
ranger_2$confusion.matrix
rf_caret_2$finalModel["confusion"]

rf_scramble$confusion
ranger_scram$confusion.matrix
rf_caret_scramble$finalModel["confusion"]

结果(格式稍作修改以供比较):

> rf_three_cat$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1121               3                                697   0.3844042
2. Independents                                                   263               7                                261   0.9868173
3. Republicans (including leaners)                                509               9                               1096   0.3209418                        

> ranger_3$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1128              46                                647   0.3805601
2. Independents                                                 263              23                                245   0.9566855
3. Republicans (including leaners)                              572              31                               1011   0.3736059

> rf_caret_3$finalModel["confusion"]
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1268               0                                553   0.3036793
2. Independents                                                   304               0                                227   1.0000000
3. Republicans (including leaners)                                606               0                               1008   0.3754647

> rf_two_cat$confusion
                                     1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1775                                 46   0.0252608
3. Republicans (including leaners)                               1581                                 33   0.9795539

> ranger_2$confusion.matrix
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1154                                667   0.3662823
3. Republicans (including leaners)                              590                               1024   0.3655514

> rf_caret_2$finalModel["confusion"]
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315                                  506   0.2778693
3. Republicans (including leaners)                              666                                  948   0.4126394

> rf_scramble$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1104               0                                717   0.3937397
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              501               0                               1112   0.3106014

> ranger_scram$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners)
1. Democrats (including leaners)                               1159               0                               662  0.3635365
2. Independents                                                   0               0                                 1  1.0000000
3. Republicans (including leaners)                              577               0                              1036  0.3577185

> rf_caret_scramble$finalModel["confusion"]
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315               0                                506   0.2778693
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              666               0                                947   0.4128952

标签: rmachine-learningrandom-forestr-caret

解决方案


首先,随机森林算法是……随机的,所以默认情况下会出现一些变化。其次,更重要的是,算法是不同的,即它们使用不同的步骤,这就是你得到不同结果的原因。

您应该看看他们如何执行拆分(哪些标准:gini、extra 等)以及这些是否是随机的(非常随机的树),他们如何对 bootstrap 样本进行采样(有/没有替换)以及什么比例,mtry或者如何在每个分割、节点中的最大深度或最大案例等处选择许多变量。


推荐阅读