首页 > 解决方案 > 为什么我有不同的编码器产生相同的结果?

问题描述

我正在使用California Housing Price dataset,这就是我所做的:

import pandas as pd
from sklearn.model_selection import train_test_split

housing = pd.read_csv("housing.csv")

X = housing.drop(["longitude", "latitude", "median_house_value"], axis=1)
y = housing["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

import category_encoders as ce

encoder_list = [ce.WOEEncoder(), ce.OneHotEncoder()]

for encoder in encoder_list:

    numeric_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ]
    )

    categorical_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="constant")),
            ("encoder", encoder),
        ]
    )

    pipe = Pipeline(
        steps=[("preprocessor", preprocessor), ("regressor", LinearRegression())]
    )

    pipe.fit(X_train, y_train)

    pipe.predict(X_test)

    print(encoder)
    print(pipe.score(X_test, y_test))

为什么这会产生两个相似的结果?他们不应该不同吗?当我尝试不同的缩放器时,也会发生同样的情况。

标签: pythonpandasscikit-learnlinear-regressionpipeline

解决方案


推荐阅读