首页 > 解决方案 > 现有课程较少的 xgboost 增量训练

问题描述

我在训练增量 xgboost 模型时遇到了问题。如果我的增量数据集包含每个类别的样本,那么增量训练效果很好。无论如何,如果数据集不包含来自每个类别的样本,则训练失败。

MRE:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split as ttsplit
import xgboost as xgb

X = load_iris()['data']
y = load_iris()['target']

# split data into training and testing sets
# then split training set in half for base model and incremental model
X_train, X_test, y_train, y_test = ttsplit(X, y, test_size=0.1, random_state=0)
X_train_1, X_train_2, y_train_1, y_train_2 = ttsplit(
    X_train, y_train, test_size=0.5, random_state=0)

clf = xgb.XGBClassifier(use_label_encoder=False)
clf.fit(X_train_1, y_train_1)

# Artificially remove one group from the labels to showcase behavior
y_train_2[y_train_2 == 0] = 1
clf2 = xgb.XGBClassifier(use_label_encoder=False)
clf2.fit(X_train_2, y_train_2, xgb_model=clf)

结果是:

ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class - 1].

这是从头开始重新训练的预期行为,但期望增量数据集在每个类别中都有样本并不方便。我能做些什么来克服这个限制?

标签: pythonxgboost

解决方案


推荐阅读