r - R 中 CARET 中的训练、验证、测试拆分模型
问题描述
我想请教一下。我使用此代码运行 Caret 包中的 XGboost 模型。但是,我想使用基于时间的验证拆分。我想要 60% 的训练,20% 的验证,20% 的测试。我已经拆分了数据,但是如果不是交叉验证,我确实知道如何处理验证数据。
谢谢,
xgb_trainControl = trainControl(
method = "cv",
number = 5,
returnData = FALSE
)
xgb_grid <- expand.grid(nrounds = 1000,
eta = 0.01,
max_depth = 8,
gamma = 1,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
set.seed(123)
xgb1 = train(sale~., data = trans_train,
trControl = xgb_trainControl,
tuneGrid = xgb_grid,
method = "xgbTree",
)
xgb1
pred = predict(lm1, trans_test)
解决方案
创建模型时不应使用验证分区 - 它应该被“搁置”,直到使用“训练”和“调整”分区对模型进行训练和调整,然后您可以应用模型来预测结果验证数据集并总结预测的准确性。
例如,在我自己的工作中,我创建了三个分区:训练(75%)、调整(10%)和测试/验证(15%)使用
# Define the partition (e.g. 75% of the data for training)
trainIndex <- createDataPartition(data$response, p = .75,
list = FALSE,
times = 1)
# Split the dataset using the defined partition
train_data <- data[trainIndex, ,drop=FALSE]
tune_plus_val_data <- data[-trainIndex, ,drop=FALSE]
# Define a new partition to split the remaining 25%
tune_plus_val_index <- createDataPartition(tune_plus_val_data$response,
p = .6,
list = FALSE,
times = 1)
# Split the remaining ~25% of the data: 40% (tune) and 60% (val)
tune_data <- tune_plus_val_data[-tune_plus_val_index, ,drop=FALSE]
val_data <- tune_plus_val_data[tune_plus_val_index, ,drop=FALSE]
# Outcome of this section is that the data (100%) is split into:
# training (~75%)
# tuning (~10%)
# validation (~15%)
这些数据分区被转换为 xgb.DMatrix 矩阵(“dtrain”、“dtune”、“dval”)。然后我使用“训练”分区来训练模型,并使用“调整”分区来调整超参数(例如随机网格搜索)和评估模型训练(例如交叉验证)。这〜相当于您问题中的代码。
lrn_tune <- setHyperPars(lrn, par.vals = mytune$x)
params2 <- list(booster = "gbtree",
objective = lrn_tune$par.vals$objective,
eta=lrn_tune$par.vals$eta, gamma=0,
max_depth=lrn_tune$par.vals$max_depth,
min_child_weight=lrn_tune$par.vals$min_child_weight,
subsample = 0.8,
colsample_bytree=lrn_tune$par.vals$colsample_bytree)
xgb2 <- xgb.train(params = params2,
data = dtrain, nrounds = 50,
watchlist = list(val=dtune, train=dtrain),
print_every_n = 10, early_stopping_rounds = 50,
maximize = FALSE, eval_metric = "error")
训练模型后,我将模型应用于验证数据predict()
:
xgbpred2_keep <- predict(xgb2, dval)
xg2_val <- data.frame("Prediction" = xgbpred2_keep,
"Patient" = rownames(val),
"Response" = val_data$response)
# Reorder Patients according to Response
xg2_val$Patient <- factor(xg2_val$Patient,
levels = xg2_val$Patient[order(xg2_val$Response)])
ggplot(xg2_val, aes(x = Patient, y = Prediction,
fill = Response)) +
geom_bar(stat = "identity") +
theme_bw(base_size = 16) +
labs(title=paste("Patient predictions (xgb2) for the validation dataset (n = ",
length(rownames(val)), ")", sep = ""),
subtitle="Above 0.5 = Non-Responder, Below 0.5 = Responder",
caption=paste("JM", Sys.Date(), sep = " "),
x = "") +
theme(axis.text.x = element_text(angle=90, vjust=0.5,
hjust = 1, size = 8)) +
# Distance from red line = confidence of prediction
geom_hline(yintercept = 0.5, colour = "red")
# Convert predictions to binary outcome (responder / non-responder)
xgbpred2_binary <- ifelse(predict(xgb2, dval) > 0.5,1,0)
# Results matrix (i.e. true positives/negatives & false positives/negatives)
confusionMatrix(as.factor(xgbpred2_binary), as.factor(labels_tv))
# Summary of results
Summary_of_results <- data.frame(Patient_ID = rownames(val),
label = labels_tv,
pred = xgbpred2_binary)
Summary_of_results$eval <- ifelse(
Summary_of_results$label != Summary_of_results$pred,
"wrong",
"correct")
Summary_of_results$conf <- round(predict(xgb2, dval), 2)
Summary_of_results$CDS <- val_data$`variants`
Summary_of_results
这为您提供了模型在您的验证数据上“工作”的程度的摘要。
推荐阅读
- android - 在 Cmdline-tools 中找不到预制双簧管 REQUIRED CONFIG(版本 6858069)
- android-studio - 在 Kotlin 中构建失败并出现异常错误
- windows-10 - 如何使用免费工具集创建 MSI 安装程序以进行静默安装
- jenkins - 如何将 REST API 与指向共享库的 Jenkinsfile 一起使用?
- c++ - 改变
/ 的对比 - python - python scrapy从产品页面获取url列表
- windows - 由于电源故障而突然关闭后,Windows .NET 应用程序无法打开
- json - 带有 Json 数据的 Flutter gridview 项目导航
- django-models - 添加一个对象同时删除另一个对象
- visual-studio-code - VSCode 扩展从今天起不起作用