首页 > 解决方案 > 我可以为 sklearn 的 train_test_split 指定特定的行吗?我需要知道哪些行是测试数据

问题描述

我正在使用著名的喷气发动机数据集来执行 RUL 预测,我正在比较不同类型的回归,一切正常。

我一直在舒适地使用 sklearn 的 train_test_split,将其设置为 0.3,这是我想要的,但是,我需要知道哪些行被用作训练和测试拆分,因为我需要将它们用于其他事情。这有意义吗?他们被修复了吗?还是他们被互换和交叉验证,我不明白?

我认为我的怀疑只是数据争吵,但也想知道它是否以某种方式干预了模型。

我的数据集形状是 (20631, 17)

一些相关代码:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_squared_error as mse
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

进行拆分

X_train, X_test, y_train, y_test = train_test_split(train_df.drop(columns = ["RUL", "unit"], axis=1), train_df["RUL"], test_size=0.3, random_state=42)

线性回归

from sklearn.linear_model import LinearRegression
LM = LinearRegression()
LM.fit(X_train, y_train)

决策树回归

from sklearn.tree import DecisionTreeRegressor
DT = DecisionTreeRegressor(random_state = 42)
DT_random_grid = {'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5),
               'max_features': ["auto", "sqrt", "log2"]}
DT_gs  = RandomizedSearchCV(estimator = DT, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=DT_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True)
DT_gs.fit(X_train,y_train)
DT = DT_gs.best_estimator_

随机森林回归

from sklearn.ensemble import RandomForestRegressor
RF = model = RandomForestRegressor(criterion="mse", random_state = 42, verbose = 1)
RF_random_grid = {'n_estimators': range(10, 300),
               'max_features': ['auto', 'sqrt', 'log2'],
               'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5)}
RF_gs  = RandomizedSearchCV(estimator = RF, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=RF_random_grid,n_iter=80,cv=5,iid=True,return_train_score =True, verbose = 1)
RF_gs.fit(X_train,y_train)
RF = RF_gs.best_estimator_

梯度提升树回归

from sklearn.ensemble import GradientBoostingRegressor
GB =  GradientBoostingRegressor(random_state = 42)
GB_random_grid = {'n_estimators': range(10, 300),
               'learning_rate': [0.01, 0.05, 0.1, 0.2],
               'min_samples_split': range(2, 10),
               'min_samples_leaf': range(1, 5),
                 'max_depth': range(2,8)}
GB_gs  = RandomizedSearchCV(estimator = GB, n_jobs=-1, scoring = "neg_mean_squared_error",
                        param_distributions=GB_random_grid,n_iter=82,cv=5,iid=True,return_train_score =True, verbose = 1)
GB_gs.fit(X_train,y_train)
GB = GB_gs.best_estimator_

谢谢!

标签: pythonmachine-learningscikit-learn

解决方案


推荐阅读