python - 使用 GaussianNB 分类器进行交叉验证的 Python 朴素贝叶斯
问题描述
我想对我的数据应用具有 10 倍分层交叉验证的朴素贝叶斯,然后我想看看模型在我最初预留的测试数据上的表现如何。但是,我得到的结果(即预测结果和概率值y_pred_nb2
和y_score_nb2
)与我在没有任何交叉验证的情况下运行代码时相同。问题:我该如何纠正这个问题?
代码如下,其中X_train
包含整个数据集的 75% 和X_test
25%。
from sklearn.model_selection import StratifiedKFold
params = {}
#gridsearch searches for the best hyperparameters and keeps the classifier with the highest recall score
skf = StratifiedKFold(n_splits=10)
nb2 = GridSearchCV(GaussianNB(), cv=skf, param_grid=params)
%time nb2.fit(X_train, y_train)
# predict values on the test set
y_pred_nb2 = nb2.predict(X_test)
print(y_pred_nb2)
# predicted probabilities on the test set
y_scores_nb2 = nb2.predict_proba(X_test)[:, 1]
print(y_scores_nb2)
解决方案
首先,GaussianNB
仅接受priors
作为参数,因此除非您提前为模型设置一些先验,否则您将无法进行网格搜索。
此外,您param_grid
被设置为一个空字典,以确保您只适合一个估计器GridSearchCV
。这与在不使用网格搜索的情况下拟合估计器相同(例如,我使用MultinomialNB
它来显示超参数的使用):
from sklearn.datasets import load_iris
from sklearn.model_selection import GridSearchCV, StratifiedKFold, train_test_split
from sklearn.naive_bayes import MultinomialNB
skf = StratifiedKFold(n_splits=10)
params = {}
nb = MultinomialNB()
gs = GridSearchCV(nb, cv=skf, param_grid=params, return_train_score=True)
data = load_iris()
x_train, x_test, y_train, y_test = train_test_split(data.data, data.target)
gs.fit(x_train, y_train)
gs.cv_results_
{'mean_fit_time': array([0.]),
'mean_score_time': array([0.]),
'mean_test_score': array([0.85714286]),
'mean_train_score': array([0.85992157]),
'params': [{}],
'rank_test_score': array([1]),
'split0_test_score': array([0.91666667]),
'split0_train_score': array([0.84]),
'split1_test_score': array([0.75]),
'split1_train_score': array([0.86]),
'split2_test_score': array([0.83333333]),
'split2_train_score': array([0.84]),
'split3_test_score': array([0.91666667]),
'split3_train_score': array([0.83]),
'split4_test_score': array([0.83333333]),
'split4_train_score': array([0.85]),
'split5_test_score': array([0.91666667]),
'split5_train_score': array([0.84]),
'split6_test_score': array([0.9]),
'split6_train_score': array([0.88235294]),
'split7_test_score': array([0.8]),
'split7_train_score': array([0.88235294]),
'split8_test_score': array([0.8]),
'split8_train_score': array([0.89215686]),
'split9_test_score': array([0.9]),
'split9_train_score': array([0.88235294]),
'std_fit_time': array([0.]),
'std_score_time': array([0.]),
'std_test_score': array([0.05832118]),
'std_train_score': array([0.02175538])}
nb.fit(x_train, y_train)
nb.score(x_test, y_test)
0.8157894736842105
gs.score(x_test, y_test)
0.8157894736842105
gs.param_grid = {'alpha': [0.1, 2]}
gs.fit(x_train, y_train)
gs.score(x_test, y_test)
0.8421052631578947
gs.cv_results_
{'mean_fit_time': array([0.00090394, 0.00049713]),
'mean_score_time': array([0.00029924, 0.0003005 ]),
'mean_test_score': array([0.86607143, 0.85714286]),
'mean_train_score': array([0.86092157, 0.85494118]),
'param_alpha': masked_array(data=[0.1, 2],
mask=[False, False],
fill_value='?',
dtype=object),
'params': [{'alpha': 0.1}, {'alpha': 2}],
'rank_test_score': array([1, 2]),
'split0_test_score': array([0.91666667, 0.91666667]),
'split0_train_score': array([0.84, 0.83]),
'split1_test_score': array([0.75, 0.75]),
'split1_train_score': array([0.86, 0.86]),
'split2_test_score': array([0.83333333, 0.83333333]),
'split2_train_score': array([0.85, 0.84]),
'split3_test_score': array([0.91666667, 0.91666667]),
'split3_train_score': array([0.83, 0.81]),
'split4_test_score': array([0.83333333, 0.83333333]),
'split4_train_score': array([0.85, 0.84]),
'split5_test_score': array([0.91666667, 0.91666667]),
'split5_train_score': array([0.84, 0.84]),
'split6_test_score': array([0.9, 0.9]),
'split6_train_score': array([0.88235294, 0.88235294]),
'split7_test_score': array([0.9, 0.8]),
'split7_train_score': array([0.88235294, 0.88235294]),
'split8_test_score': array([0.8, 0.8]),
'split8_train_score': array([0.89215686, 0.89215686]),
'split9_test_score': array([0.9, 0.9]),
'split9_train_score': array([0.88235294, 0.87254902]),
'std_fit_time': array([0.00030147, 0.00049713]),
'std_score_time': array([0.00045711, 0.00045921]),
'std_test_score': array([0.05651628, 0.05832118]),
'std_train_score': array([0.02103457, 0.02556351])}
推荐阅读
- entitlements - 将数据推送到 Open Policy Agent 时如何处理容器重启
- json - Angular,如何订阅嵌套的 JSON 对象?
- odata - 如何将 $filter 添加到 XML 中的 OData 聚合绑定
- javascript - 如何重用一个组件确实在多个组件之间挂载了等效的钩子
- xml - XSL 将 XML 转换为更小的 XML
- node.js - 使用 NodeJS/Puppeteer 下载多个图像
- python - 为什么在尝试计算具有较大初始值的值时总是出现溢出错误?
- python - 使用 Cmake 配置暗网时出错
- linux - “中断处理程序的中断服务程序”是什么意思?
- javascript - 在 Vue App 中使用动态目录的 webpack require.context() 的解决方法