pandas - 如何结合 GridSearchCV 和 SelectFromModel 来减少特征数量?
问题描述
我正在尝试使用 sklearn 和 pandas 运行 QSAR。这是我的代码:
import pandas as pd
import numpy as np
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
FC_data = pd.read_excel('C:\\Users\\Dre\\Desktop\\My Papers\\Furocoumarins_paper_2018\\Furocoumarins_NEW1.xlsx', index_col=0)
FC_data.head()
# Create correlation matrix
corr_matrix = FC_data.corr().abs()
# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features
FC_data1 = FC_data.drop(FC_data[to_drop], axis=1)
y = FC_data1.LogFiT
X = FC_data1.drop(['LogFiT', 'LogS'], axis=1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
randomforest = RandomForestRegressor(n_jobs=-1)
selector = SelectFromModel(randomforest)
features_important = selector.fit_transform(X_train, y_train)
model = randomforest.fit(features_important, y_train)
from sklearn.model_selection import GridSearchCV
clf_rf = RandomForestRegressor()
parameters = {"n_estimators":[1, 2, 3, 4, 5, 7, 10, 15, 20, 25, 30, 50], "max_depth":[1, 2, 3, 4, 5, 10, 15]}
grid_search_cv_clf = GridSearchCV(model, parameters, cv=5)
grid_search_cv_clf.fit(features_important, y_train)
from sklearn.metrics import r2_score
y_pred = grid_search_cv_clf.predict(features_important)
r2_score(y_train, y_pred)
grid_search_cv_clf.best_params_
best_clf = grid_search_cv_clf.best_estimator_
best_clf.score(X_test, y_test)
错误:ValueError:模型的特征数量必须与输入匹配。模型 n_features 为 22,输入 n_features 为 114
feature_importances = best_clf.feature_importances_
feature_importances_df = pd.DataFrame({'features':list(X_train),
'feature_importances':feature_importances})
importances = feature_importances_df.sort_values('feature_importances', ascending=False)
importances.head(20)
这给出了一个错误: ValueError: arrays must be all be same length
我知道问题是 features_important (X_train) 和 x_test 中的功能数量不同,但我不知道如何解决。请帮忙!
解决方案
您可以过滤 X_test 的列:
X_test_filtered = X_test.iloc[:,selector.get_support()]
best_clf.score(X_test_filtered, y_test)
相反,在第二段代码上,我认为这是您想要的,但如果我错了,请纠正我:
feature_importances = best_clf.feature_importances_
feature_importances_df = pd.DataFrame({'features': X_test_filtered.columns.values),
'feature_importances':feature_importances})
importances = feature_importances_df.sort_values('feature_importances', ascending=False)
importances.head(20)
编辑 - 当我意识到选择器返回一个数组而不是一个系列时,我编辑了该方法。使用 get_support 获取掩码应该可以工作
推荐阅读
- javascript - 创建时间戳数组删除开始和结束期间之间的小时数
- python - 如何顺序链接 ML 模型/管道模型?
- javascript - 返回数组中最大数的函数
- excel - 列出没有宏或 Excel 4 函数的工作表名称
- angular - flatMap GET 和 PUT 请求不返回结果
- jenkins-pipeline - 使用 readFile 从文件中读取数据并将其转换为 groovy 中的列表的最佳方法是什么?
- node.js - Koa.js:只有第一个中间件执行,其他所有失败
- firefox - firefox 以编程方式更改阻止弹出窗口设置
- r - 如何将每小时太阳辐射值转换为每天计算的 0 到 1 之间的每小时值
- jquery - 如何阻止 Application Insights 记录在 jquery 中生成的已处理异常?