首页 > 解决方案 > R中xgboost的非常奇怪的行为

问题描述


我正在学习在 R 中使用 XGBoost 包,但遇到了一些非常奇怪的行为,我不知道如何解释。也许有人可以给我一些指示。我尽可能简化了 R 代码:

rm(list = ls())
library(xgboost)
setwd("/home/my_username/Documents/R_files")

my_data <- read.csv("my_data.csv")
my_data$outcome_01 = ifelse(my_data$outcome_continuous > 0.0, 1, 0)

reg_features = c("feature_1", "feature_2")
class_features = c("feature_1", "feature_3")

set.seed(93571)
train_data = my_data[seq(1, nrow(my_data), 2), ]

mm_reg_train = model.matrix(~ . + 0, data = train_data[, reg_features])
train_DM_reg = xgb.DMatrix(data = mm_reg_train, label = train_data$outcome_continuous)

var_nrounds = 190
xgb_reg_model = xgb.train(data = train_DM_reg, booster = "gbtree", objective = "reg:squarederror",
                          nrounds = var_nrounds, eta = 0.07,
                          max_depth = 5, min_child_weight = 0.8, subsample = 0.6, colsample_bytree = 1.0,
                          verbose = F)

mm_class_train = model.matrix(~ . + 0, data = train_data[, class_features])
train_DM_class = xgb.DMatrix(data = mm_class_train, label = train_data$outcome_01)

xgb_class_model = xgb.train(data = train_DM_class, booster = "gbtree", objective = "binary:logistic",
                            eval_metric = 'auc', nrounds = 70, eta = 0.1,
                            max_depth = 3, min_child_weight = 0.5, subsample = 0.75, colsample_bytree = 0.5,
                            verbose = F)

probabilities = predict(xgb_class_model, newdata = train_DM_class, type = "response")
print(paste0("simple check: ", sum(probabilities)), quote = F)

问题是: 的结果sum(probabilities)取决于var_nrounds!

怎么可能?毕竟var_nrounds只进入xgb_reg_model,而概率是用 计算的xgb_class_model确实(应该)对 的值一无所知var_nrounds。我在这段代码中唯一改变的是 的值,var_nrounds但是当我重新运行它时,概率的总和会改变。它也会发生确定性的变化,即,var_nrounds = 190我总是得到(用我的数据)5324.3var_nrounds = 285: 5322.8。但是,如果我删除 line set.seed(93571),那么每次我重新运行代码时结果都会发生不确定的变化。

是不是 XGBoost 有某种内置的随机行为,它会根据预先在另一个模型中运行的轮数而变化,并且在训练 XGBoost 之前也可以通过在代码中的某处设置种子来控制这种行为?有任何想法吗?

标签: rrandomxgboostrandom-seed

解决方案


推荐阅读