首页 > 解决方案 > R插入符号:带有qrf的火车中的“二元运算符的非数字参数”

问题描述

当我使用 运行分位数回归森林模型时caret::train,出现以下错误:Error in { : task 1 failed - "non-numeric argument to binary operator".

当我设置ntree为更高的数字时(在我的可重现示例中,这将是ntree = 150),我的代码运行没有错误。

这段代码

library(caret)
library(quantregForest)

data(segmentationData)

dat <- segmentationData[segmentationData$Case == "Train",]
dat <- dat[1:50,]

# predictors
preds <- dat[,c(5:ncol(dat))]

# convert all to numeric
preds <- data.frame(sapply(preds, function(x) as.numeric(as.character(x))))

# response variable
response <- dat[,4]

# set up error measures
sumfct <- function(data, lev = NULL, model = NULL){
  RMSE <- sqrt(mean((data$pred - data$obs)^2, na.omit = TRUE))
  c(RMSE = RMSE)
}


# specify folds
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
folds_train <- caret::createMultiFolds(y = dat$Cell,
                                       k = 10,
                                       times = 5)

# specify trainControl for tuning mtry with the created multifolds
finalcontrol <- caret::trainControl(search = "grid", method = "repeatedcv", number = 10, repeats = 5, 
                                    index = folds_train, savePredictions = TRUE, summaryFunction = sumfct)

# build grid for tuning mtry
tunegrid <- expand.grid(mtry = c(2, 10, sqrt(ncol(preds)), ncol(preds)/3))

# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds, 
                      y = response,
                      method ="qrf",
                      ntree = 30, # with ntree = 150 it works
                      metric = "RMSE",
                      tuneGrid = tunegrid,
                      trControl = finalcontrol,
                      importance = TRUE,
                      keep.inbag = TRUE
)

产生错误。具有我真实数据的模型已经ntree = 10000完成并且任务仍然失败。我怎样才能解决这个问题?

在 caret 的源代码中哪里可以找到错误消息的条件Error in { : task 1 failed - "non-numeric argument to binary operator"?错误信息来自源代码的哪一部分?

更新: 我根据 StupidWolf 的回答用我的真实数据调整了我的代码,所以它看起来像这样:

# train model
set.seed(42, kind = "Mersenne-Twister", normal.kind = "Inversion")
model <- caret::train(x = preds, 
                      y = response,
                      method ="qrf",
                      ntree = 30, # with ntree = 150 it works
                      metric = "RMSE",
                      sampsize = ceiling(length(response)*0.4)
                      tuneGrid = tunegrid,
                      trControl = finalcontrol,
                      importance = TRUE,
                      keep.inbag = FALSE
)

使用我的真实数据,我仍然会收到上述错误消息,因此我必须0.1*length(response)在最坏的情况下调整样本大小才能成功计算模型。所以只有设置keep.inbag = FALSE仍然会产生错误。我有多达 1500 个预测变量,而样本(行)的数量只有 50 到 60。我仍然不明白,究竟是什么导致了错误消息。我尝试了没有sampsize参数的模型,但总是设置keep.inbag = FALSE. 错误仍在发生。只有将样本大小设置得非常低才能确保成功。

如何在不设置 sampsize 的情况下成功运行模型?实际上,我想要的是自备数据集的引导方法,而不是用于训练森林的数据集的 40% 或 10% 的人工样本。

标签: rrandom-forestr-caret

解决方案


您收到错误是因为您使用keep.inbag = TRUE了 quantregforest代码第 95 行中的选项:

minoob <- min( apply(!is.na(valuesPredict),1,sum))
if(minoob<10) stop("need to increase number of trees for sufficiently many out-of-bag observations")

因此,它要求您的所有观察结果至少有 10 个 OOB(out of bag)实例,以保持 out of bag 预测。因此,如果您的真实数据非常庞大,那么ntrees保存在袋外的需求将是巨大的。

如果您使用插入符号来训练数据,那么保持 OOB 和拥有savePredictions = TRUE似乎是多余的。总体而言,OOB 预测可能不是那么有用,因为无论如何您都将使用测试折叠进行预测。

考虑到数据的大小,另一种选择是调整sampsize. 在 randomForest 中,仅sampsize使用替换子集对一些观察值进行采样以拟合树。如果为此设置较小的大小,则可以确保有足够的 OOB。例如,在给出的示例中,我们可以看到:

model <- caret::train(x = preds, 
                      y = response,
                      method ="qrf",
                      ntree = 30, sampsize=17,
                      metric = "RMSE",
                      tuneGrid = tunegrid,
                      trControl = finalcontrol,
                      importance = TRUE,
                      keep.inbag = TRUE)

model
Quantile Random Forest 

50 samples
57 predictors

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 44, 43, 44, 46, 45, 46, ... 
Resampling results across tuning parameters:

  mtry       RMSE    
   2.000000  42.53061
   7.549834  42.72116
  10.000000  43.11533
  19.000000  42.80340

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.

推荐阅读