python - 在管道中训练 RFE 和模型后无法预测新数据
问题描述
我是 Python 和机器学习的新手,我肯定会遗漏一些东西。
我正在通过嵌套 CV 训练 RandomForest 模型以进行超参数调整,并使用管道训练 RFECV。我检索了 best_estimator_.n_features,它仍然向我展示了 RFECV 缩小到 3 个之前的 17 个原始特征。
X
1182 rows × 17 columns
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
clf = RandomForestClassifier(random_state=42, n_jobs=-1, criterion='entropy', bootstrap=False)
space = {'n_estimators': [900, 1000, 1100],
'max_depth': [25, 50, 100],
'min_samples_split': [500, 750, 1000],
'min_samples_leaf': [32, 64]
}
search = GridSearchCV(clf, space, scoring='accuracy', n_jobs=1, cv=cv_inner, refit=True)
rfe = RFECV(estimator=RandomForestClassifier())
ppln = Pipeline(steps=[('rfe',rfe),('grid',search)])
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_val_score(ppln, X, y.ravel(), scoring='accuracy', cv=cv_outer, n_jobs=-1)
ppln.fit(X, y.ravel())
安装管道后,我尝试预测具有原始 17 个特征的新数据(固定)。但是显示的错误消息是:“ValueError:模型的特征数必须与输入匹配。模型 n_features 为 17,输入 n_features 为 3。”
fixtureXLS = pd.read_excel('aaafixtures.xlsx')
fixtureXLS.to_csv('bbbfixtures.csv', encoding='utf-8')
fixt = pd.read_csv('bbbfixtures.csv')
fixt = fixt.loc[:, ~fixt.columns.str.contains('^Unnamed')]
if 'Result' in fixt.columns:
fixt = fixt.drop(['Result'], axis=1)
fixt
287 rows × 17 columns
fixt['Predicted'] = ppln.predict(fixt)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-164-e54f4c6f6e05> in <module>
----> 1 temp = ppln.predict(fixt)
~\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
117
118 # lambda, but not partial, allows help() to work with update_wrapper
--> 119 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
120 # update the docstring of the returned function
121 update_wrapper(out, self.fn)
~\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
406 for _, name, transform in self._iter(with_final=False):
407 Xt = transform.transform(Xt)
--> 408 return self.steps[-1][-1].predict(Xt, **predict_params)
409
410 @if_delegate_has_method(delegate='_final_estimator')
~\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
117
118 # lambda, but not partial, allows help() to work with update_wrapper
--> 119 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
120 # update the docstring of the returned function
121 update_wrapper(out, self.fn)
~\anaconda3\lib\site-packages\sklearn\model_selection\_search.py in predict(self, X)
485 """
486 self._check_is_fitted('predict')
--> 487 return self.best_estimator_.predict(X)
488
489 @if_delegate_has_method(delegate=('best_estimator_', 'estimator'))
~\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict(self, X)
627 The predicted classes.
628 """
--> 629 proba = self.predict_proba(X)
630
631 if self.n_outputs_ == 1:
~\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict_proba(self, X)
671 check_is_fitted(self)
672 # Check data
--> 673 X = self._validate_X_predict(X)
674
675 # Assign chunk of trees to jobs
~\anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in _validate_X_predict(self, X)
419 check_is_fitted(self)
420
--> 421 return self.estimators_[0]._validate_X_predict(X, check_input=True)
422
423 @property
~\anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
394 n_features = X.shape[1]
395 if self.n_features_ != n_features:
--> 396 raise ValueError("Number of features of the model must "
397 "match the input. Model n_features is %s and "
398 "input n_features is %s "
ValueError: Number of features of the model must match the input. Model n_features is 17 and input n_features is 3
我将 fixt 转换为 3 个特征并预测管道:
X_new = rfe.transform(fixt)
print(X_new.shape[1])
fixt['Predicted'] = ppln.predict(X_new)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-161-02280f45be5a> in <module>
----> 1 fixt['Predicted'] = ppln.predict(X_new)
~\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in <lambda>(*args, **kwargs)
117
118 # lambda, but not partial, allows help() to work with update_wrapper
--> 119 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
120 # update the docstring of the returned function
121 update_wrapper(out, self.fn)
~\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
405 Xt = X
406 for _, name, transform in self._iter(with_final=False):
--> 407 Xt = transform.transform(Xt)
408 return self.steps[-1][-1].predict(Xt, **predict_params)
409
~\anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in transform(self, X)
82 return np.empty(0).reshape((X.shape[0], 0))
83 if len(mask) != X.shape[1]:
---> 84 raise ValueError("X has a different shape than during fitting.")
85 return X[:, safe_mask(X, mask)]
86
ValueError: X has a different shape than during fitting.
请你帮我送些光好吗?!
解决方案
我不知道是否有一种自动化的方法来实现它,但我创建了一个新的管道,其中 RandomForestClassfiers 取自先前管道的最佳估计器,拟合然后预测。我不得不在艰难之前对其进行 RFE。
相反ppln.fit(X, y.ravel())
,最终的代码是
params = search.best_estimator_.get_params()
rfc = RandomForestClassifier(**params)
ppln_new = Pipeline(steps=[('rfe',rfe),('pred',rfc)])
ppln_new.fit(X, y.ravel())
fixt['Predicted'] = ppln_new.predict(fixt)
推荐阅读
- sql - Postgres - 旋转多行 - 性能
- json - 包装键可解码避免结构
- python - 我如何知道在 tf.keras 中实现了哪个版本的 Keras API?
- servlets - 如何在带有servlet java和html的eclipse中使用primefaces库插入条形图
- r - 来自 R 数据框:按列计算非 NA 值,按其中一列分组
- javascript - 使用 React 功能组件作为类型
- java - 传递 url 变量导致“找不到 HTTP 请求的映射”
- android - Android Studio 布局编辑器不工作
- redux - 命名 redux 动作类型描述的最佳实践
- java - 准确估算 RAM 成本