首页 > 解决方案 > 在 scikit learn/pandas 函数中出现错误提示列不存在

问题描述

我正在尝试训练这个随机分类器,看看我的预处理是否有效。正如我在错误消息(价格)中看到的那样,我认为我在分离训练数据和标签时犯了一个错误。但我不知道到底出了什么问题。

代码:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier


def diamond_preprocess(data_dir):
    data = pd.read_csv(data_dir)
    cleaned_data = data.drop(['id', 'depth_percent'], axis=1)  # Features I don't want

    x = cleaned_data.drop(['price'], axis=1)  # Train data
    y = cleaned_data['price']  # Label data

    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

    numerical_features = cleaned_data.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = cleaned_data.select_dtypes(include=['object']).columns

    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),  # Fill in missing data with median
        ('scaler', StandardScaler())  # Scale data
    ])

    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),  # Fill in missing data with 'missing'
        ('onehot', OneHotEncoder(handle_unknown='ignore'))  # One hot encode categorical data
    ])

    preprocessor_pipeline = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ])

    rf = Pipeline(steps=[('preprocessor', preprocessor_pipeline),
                         ('classifier', RandomForestClassifier())])

    rf.fit(x_train, y_train)

clean_data.columns: Index(['carat', 'cut', 'color', 'clarity', 'table', 'price', 'length', 'width', 'depth'], dtype='object')

错误:

  File "pandas\_libs\hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'price'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\sklearn\utils\__init__.py", line 396, in _get_column_indices
    col_idx = all_columns.get_loc(col)
  File "C:\Users\17574\Anaconda3\envs\kraken-gpu\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'price'

The above exception was the direct cause of the following exception:

ValueError: A given column is not a column of the dataframe

我将 x_train (排除了价格,因为它是我的训练数据)输入到包含“价格”功能的预处理管道中似乎很生气。这应该不是问题,因为我的标签都是“价格”整数,需要进行预处理,对吗?我需要一个单独的变压器来制作标签吗?

标签: pythonpandasscikit-learn

解决方案


您正在执行ColumnTransformer基于cleaned_dataDataFrame 中定义的列而不是x_train.

您可以通过x_train如下计算来修改您的分类和数字特征:

 numerical_features = x_train.select_dtypes(include=['int64', 'float64']).columns
    categorical_features = x_train.select_dtypes(include=['object']).columns

或者更好的是,通过使用sklearn.compose.make_column_selector执行如下选择:

from sklearn.compose import make_column_selector
preprocessor_pipeline = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, make_column_selector(dtype_exclude=object)),
            ('cat', categorical_transformer, make_column_selector(dtype_include=object))
        ])

推荐阅读