python - 使用 XGBOOST 执行特征选择时的不同结果
问题描述
# use feature importance for feature selection
from numpy import loadtxt
from numpy import sort
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import SelectFromModel
# load data
dataset = loadtxt('.\\DataSets\\pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model on all training data
model = XGBClassifier(eval_metric = "error")
model.fit(X_train, y_train)
# make predictions for test data and evaluate
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# Fit model using each importance as a threshold
thresholds = sort(model.feature_importances_)
for thresh in thresholds:
# select features using threshold
selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = XGBClassifier(eval_metric = "error")
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
select_X_test = X_test[:, 0:select_X_test.shape[1]]*0 + select_X_test
y_pred = selection_model.predict(select_X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
您可以从https://www.kaggle.com/kumargh/pimaindiansdiabetescsv获取数据集
运行上面的代码,结果是:
Accuracy: 74.02%
Thresh=0.088, n=8, Accuracy: 74.02%
Thresh=0.089, n=7, Accuracy: 71.65%
Thresh=0.098, n=6, Accuracy: 71.26%
Thresh=0.098, n=5, Accuracy: 74.41%
Thresh=0.100, n=4, Accuracy: 74.80%
Thresh=0.136, n=3, Accuracy: 71.26%
Thresh=0.152, n=2, Accuracy: 71.26%
Thresh=0.240, n=1, Accuracy: 67.32%
但是,当注释掉该行时select_X_test = X_test[:, 0:select_X_test.shape[1]]*0 + select_X_test
。这条线毫无意义,但结果是:
Accuracy: 74.02%
Thresh=0.088, n=8, Accuracy: 60.63%
Thresh=0.089, n=7, Accuracy: 61.02%
Thresh=0.098, n=6, Accuracy: 59.45%
Thresh=0.098, n=5, Accuracy: 57.87%
Thresh=0.100, n=4, Accuracy: 63.39%
Thresh=0.136, n=3, Accuracy: 56.30%
Thresh=0.152, n=2, Accuracy: 57.87%
Thresh=0.240, n=1, Accuracy: 67.32%
有什么区别,一个错误?我认为第一个结果是正确的。是否设置种子不会影响差异。 不同的是线 select_X_test = X_test[:, 0:select_X_test.shape[1]]*0 + select_X_test
解决方案
你 XGBClassifier
没有播种,所以它会产生不同的结果。相反,您可以使用下面来获得可重现的结果。
XGBClassifier(eval_metric = "error", random_state=3)
推荐阅读
- python - 如何在按 MultiIndex 名称选择时分配给 Pandas DataFrame?
- python - 错误文件“
",第 12 行 y_predict = column_or_1d(y, warn=True) ^ IndentationError: 意外缩进 - java - 对大数运行该方法会得出不准确的结果
- python - 如何递归访问 JSON 中的 $ref 以完成整个 JSON?
- javascript - 我想组合 jQuery、AJAX 和 Flask,但我无法从服务器获得响应以在模板上写入
- php - PHP - ifless 按数字选择
- c++ - 私有对象的互斥锁 - 全局与属性
- python - 使用 Python 登录使用 OAuth 2.0 的网站
- c++ - std::iterator::reference 必须是参考吗?
- typescript - 在源文件夹之外引用 Typescript 项目