首页 > 解决方案 > 为什么我使用 Scikit-learn 的 RandomizedSearchCV 使位置索引器越界错误

问题描述

如果这个问题更好地放在 CrossValidated 中,请告诉我。我想我会从这里开始,因为它主要是关于特定错误的问题,它恰好在机器学习算法中,而不是关于方法或方法的问题。

我正在开展一个机器学习项目,以预测美国各县因 Kaggle 上的 COVID 导致的死亡人数。为了调整随机森林回归器的超参数,我使用了 sklearn 的RandomizedSearchCV类,但拟合它会抛出 a IndexError: positional indexers are out-of-bounds,尽管回溯只引用 pandas 模块。当正常拟合没有 RandomizedSearchCV 的随机森林回归器并使用更简单的拆分方法(无交叉验证)时,不会发生这种情况。

起初我认为这可能与我传递给它的值的范围有关,但我减少了每个参数和所有参数的值,但遇到了同样的问题。

我目前的怀疑是它在我用来对多个时间序列(MultipleTimeSeriesSplit在代码中)进行拆分的自定义交叉验证类中,但它似乎可以很好地处理训练数据。它不适用于单独的标签,因为拆分取决于fips_target特征中的列。我也不相信这是当前的问题,因为尝试拆分标签会导致缺少列错误,而不是位置索引错误。

这是什么原因造成的IndexError?我怎样才能让它发挥作用?

以下代码是针对Kaggle的,但如果你不在上面,你应该可以在此处下载必要的数据集。

import numpy as np
import pandas as pd



import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



covid = pd.read_csv("../input/us-counties-covid-19-dataset/us-counties.csv").dropna()

    
# label encoding
from sklearn.preprocessing import LabelEncoder

cat_features = ['fips']
encoder = LabelEncoder()

# Apply the label encoder to each column
encoded = covid[cat_features].apply(encoder.fit_transform)
covid = covid.assign(fips = encoded)


# target encoding
import category_encoders as ce
from sklearn.model_selection import TimeSeriesSplit

def train_valid_test_incremental(df, train_frac=0.75):
    
    train = pd.DataFrame(columns=df.columns)
    test = pd.DataFrame(columns=df.columns)
    
    for location in pd.unique(df.fips):
        idx = df.fips == location
        d = df.loc[idx]
        train = train.append(d.iloc[:int(train_frac*len(d.index))])
        test = test.append(d.iloc[int(train_frac*len(d.index)):])

    train = train.infer_objects()
    test = test.infer_objects()
    return train, test

print("splitting (non cv)")
train, test = train_valid_test_incremental(covid)

# Create the encoder itself
target_enc = ce.TargetEncoder()

# Fit the encoder using the categorical features and target
target_enc.fit(train[cat_features], train['deaths'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
test = test.join(target_enc.transform(test[cat_features]).add_suffix('_target'))

# split data and labels for test and train, keep useful features
feat_cols = train.drop(columns = ['county', 'state', 'fips', 'date', 'cases', 'deaths']).columns

feats_train = train[feat_cols]
feats_test = test[feat_cols]

deaths_train = train.deaths
deaths_test = test.deaths


# use forward chaining cv for each county "fips"
class MultipleTimeSeriesSplit():
    def __init__(self, n_cvs=5):
        self.n_cvs = n_cvs
        self.n_splits = self.n_cvs + 1
    
    def get_n_splits(self, X, y, groups):
        return self.n_splits
    
    def split(self, X, y=None, groups=None): # yielding no test data, same training data
        fips_groups = {fips: list(X[X.fips_target == fips].index) for fips in pd.unique(X.fips_target)}

        start = 0
        for n in range(2, self.n_cvs + 2):
            train, test = [], []
            for fips in pd.unique(X.fips_target):
                indices = fips_groups[fips]
                k_fold_size = len(indices) /  self.n_splits

                mid = int(k_fold_size*(n-1))
                stop = int(k_fold_size*n)

                train += indices[:mid]
                test += indices[mid: stop]
                            
            yield train, test
    
# create model, tune hyperparameters, and fit to data
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, ParameterGrid, ParameterSampler
# Random search cv, need to use time series cv method

mtscv = MultipleTimeSeriesSplit()
rforest_deaths = RandomForestRegressor(random_state=0)
rf_param_dist = ParameterGrid({'bootstrap': [True, False],
    'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, None],
    'max_features': ['auto', 'sqrt'],
    'min_samples_leaf': [1, 2, 4],
    'min_samples_split': [2, 5, 10],
    'n_estimators': [10, 50, 100, 200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]})
rf_rscv = RandomizedSearchCV(estimator=rforest_deaths, param_distributions = rf_param_dist, random_state=0, cv=mtscv, return_train_score=True)
rf_rscv.fit(feats_train, deaths_train)

training_pred_rf = rf_rscv.predict(feats_train)
train_rf_loss = mean_squared_error(deaths_train, training_pred_rf)

predictions_rf = rf_rscv.predict(feats_test)
test_rf_loss = mean_squared_error(deaths_test, predictions_rf)

print(f"Random Forest\ntraining error: {np.sqrt(train_rf_loss)}\t\ttesting error: {np.sqrt(test_rf_loss)}")

这个最小的例子不包括没有交叉验证的随机森林回归器的功能拟合,也不包括我正在测试的其他模型,也不包括我加入 covid 数据的人口统计数据。


现有答案没有解决我的错误。我真的不确定如何自己诊断或取得进展

标签: pythonmachine-learningscikit-learnindex-error

解决方案


您的函数 train_valid_test_incremental() 正在返回None。在功能代码下方运行print(len(train))代码。 train, test = train_valid_test_incremental(covid).

因此,您收到此错误。同样对于时间序列问题,请使用 LSTM/ARIMA 模型而不是 RF。


推荐阅读