首页 > 解决方案 > 在python中创建具有平均值和最小值的基线回归模型

问题描述

我想将我的回归分析结果与编码的分类变量与两个基线模型进行比较,其中基线预测被指定为组的平均值或最小值。我选择了 Rsquare 和 MAE 进行比较。下面是我的代码的简化示例,用于说明。它的工作原理是它给了我一个我认为可以实现我的目标的输出。这是正确和/或最好的方法吗?

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split 
from sklearn import metrics

df = pd.DataFrame([['a1','c1',10],
                   ['a1','c2',15],
                   ['a1','c3',20],
                   ['a1','c1',15],
                   ['a2','c2',20],
                   ['a2','c3',15],
                   ['a2','c1',20],
                   ['a2','c2',15],
                   ['a3','c3',20],
                   ['a3','c3',15],
                   ['a3','c3',15],
                   ['a3','c3',20]], columns=['aid','cid','T'])

df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
df_dummies

X = df_dummies
y = df_dummies['T']

# train test split 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regr = LinearRegression()
regr.fit(X_train, y_train)

y_pred = regr.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

# Baseline model with group average as prediction
y_pred = df.groupby('aid').agg({'T': ['mean']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

# Baseline model with group min as prediction
y_pred = df.groupby('aid').agg({'T': ['min']})
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))

标签: python-3.xscikit-learnregressionpandas-groupbylinear-regression

解决方案


首先,我会y_pred一直重命名以免混淆。

一般来说:

y_pred = df.groupby('aid').agg({'T': ['mean']})

会给你“援助”列的平均值。

并且y_pred = df.groupby('aid').agg({'T': ['min']})会给你最低限度的。

你有一个有趣的包:https ://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyRegressor.html

这有助于虚拟回归,并且内部还有其他方法。

在您的情况下,它应该像这样工作:

df_dummies = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
X = df_dummies
y = df['T']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
dummy_min=DummyRegressor(strategy='constant',constant=min_value)
dummy_min.fit(X_train,y_train)

推荐阅读