首页 > 解决方案 > 工作管道上的 GridSearchCV 返回 ValueError

问题描述

我正在使用 GridSearchCV 为我的管道找到最佳参数。

我的管道似乎运行良好,因为我可以申请:

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

我得到了一个不错的结果。

但是 GridSearchCV 显然不喜欢某些东西,我无法弄清楚。

我的管道:

feats = FeatureUnion([('age', age),
                      ('education_num', education_num),
                      ('is_education_favo', is_education_favo),
                      ('is_marital_status_favo', is_marital_status_favo),
                      ('hours_per_week', hours_per_week),
                      ('capital_diff', capital_diff),
                      ('sex', sex),
                      ('race', race),
                      ('native_country', native_country)
                     ])

pipeline = Pipeline([
        ('adhocFC',AdHocFeaturesCreation()),
        ('imputers', KnnImputer(target = 'native-country', n_neighbors = 5)),
        ('features',feats),('clf',LogisticRegression())])

我的网格搜索:

hyperparameters = {'imputers__n_neighbors' : [5,21,41], 'clf__C' : [1.0, 2.0]}

GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' , refit = False) #change n_jobs = 2, refit = False

GSCV.fit(X_train, y_train)

我收到 11 个类似的警告:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/ main .py:11:SettingWithCopyWarning:试图在数据帧的切片副本上设置值。尝试改用 .loc[row_indexer,col_indexer] = value

这是错误消息:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/ main .py:11:SettingWithCopyWarning:试图在数据帧的切片副本上设置值。尝试改用 .loc[row_indexer,col_indexer] = value

请参阅文档中的注意事项:http: //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5 /site-packages/ipykernel/ main .py:12:SettingWithCopyWarning:试图在数据帧的切片副本上设置值。尝试改用 .loc[row_indexer,col_indexer] = value

请参阅文档中的注意事项:http: //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5 /site-packages/ipykernel/ main .py:14:SettingWithCopyWarning:试图在数据帧的切片副本上设置值。尝试改用 .loc[row_indexer,col_indexer] = value

请参阅文档中的注意事项:http: //pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

-------------------------------------------------- ------------------------- ValueError Traceback (最近一次调用最后一次) in () 3 GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, score = 'roc_auc' ,refit = False) #change n_jobs = 2, refit = False 4 ----> 5 GSCV.fit(X_train, y_train)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py​​ in fit(self, X, y, groups) 943 训练/测试集。944 """ --> 945 return self._fit(X, y, groups, ParameterGrid(self.param_grid)) 946 947

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py​​ in _fit(self, X, y, groups, parameter_iterable) 562 return_times=True, return_parameters=True, 563 error_score=self.error_score) --> 564 for parameter_iterable 565 for train, test in cv_iter) 566

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in call (self, iterable) 756 # 被调度。特别是这涵盖了 Parallel 的边缘 757 # case 与耗尽的迭代器一起使用。--> 758 while self.dispatch_one_batch(iterator): 759 self._iterating = True 760 else:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator) 606 return False 607 else: --> 608 self._dispatch (任务)609 返回真 610

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch) 569 dispatch_timestamp = time.time() 570 cb = BatchCompletionCallBack( dispatch_timestamp, len(batch), self) --> 571 job = self._backend.apply_async(batch, callback=cb) 572 self._jobs.append(job) 573

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback) 107 def apply_async(self, func, callback=None ): 108 """调度要运行的函数""" --> 109 result = ImmediateResult(func) 110 if callback: 111 callback(result)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in init (self, batch) 324 # 不要延迟应用程序,避免保持input 325 # arguments in memory --> 326 self.results = batch() 327 328 def get(self):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in call (self) 129 130 def call (self): --> 131 return [func (*args, **kwargs) for func, args, kwargs in self.items] 132 133 def len (self):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in (.0) 129 130 def call (self): --> 131 return [func (*args, **kwargs) for func, args, kwargs in self.items] 132 133 def len (self):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score , return_parameters, return_n_test_samples, return_times, error_score) 236 estimator.fit(X_train, **fit_params) 237 else: --> 238 estimator.fit(X_train, y_train, **fit_params) 239 240 例外为 e:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 266 这个估计器 267 """ --> 268 Xt, fit_params = self._fit(X, y, **fit_params) 269 如果 self._final_estimator 不是 None: 270 self._final_estimator.fit(Xt, y, **fit_params)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params) 232 pass 233 elif hasattr(transform, "fit_transform" ): --> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name]) 235 else: 236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params) 495 else: 496 # arity 2 的拟合方法 (监督转换)-> 497 return self.fit(X, y, **fit_params).transform(X) 498 499

in fit(self, X, y) 16 self.ohe.fit(X_full) 17 #创建一个不包含任何空值的Dataframe,categ变量为OHE,每一行都有---> 18 X_ohe_full = self.ohe. transform(X_full[~X[self.col].isnull()].drop(self.col, axis=1)) 19 20 #在col为null的行上拟合分类器

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in getitem (self, key) 2057 return self._getitem_multilevel(key) 2058 else: -> 2059 return self._getitem_column(key) 2060 2061 def _getitem_column(self, key):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key) 2064 # get column 2065
if self.columns.is_unique: -> 2066 return self._get_item_cache(key) 2067 2068 # 重复列和可能的降维

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1384 res = cache.get(item)
1385 如果 res 为 None: -> 1386 值 = self._data.get(item) 1387 res = self._box_item_values(item, values) 1388
缓存 [item] = res

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3550 loc = indexer.item() 3551 else: -> 3552 raise ValueError("cannot label index with a null key") 3553 3554 return self.iget(loc, fastpath=fastpath)

ValueError:无法使用空键标记索引

标签: pythonpandasscikit-learnpipeline

解决方案


如果没有其他信息,我相信这是因为您的X_trainy_train变量是熊猫数据框,基本的 sci-kit 学习库无法与这些进行比较:例如,.fit分类器的方法需要一个类似对象的数组。

通过输入 pandas 数据帧,您会无意中像 numpy 数组一样索引它们,这在pandas中并不那么稳定。

尝试将您的训练数据转换为 numpy 数组:

X_train_arr = X_train.to_numpy()
y_train_arr = y_train.to_numpy()

推荐阅读