首页 > 解决方案 > 避免重复,用循环?

问题描述

我正在学习多元线性回归,我正在使用反向消除来优化我的模型和 python 作为编程语言。

我重复使用三行代码来删除显着值> 0.05(即p> 0.05)的列。我想为这些行应用一个循环或一个函数,这样我就可以避免重复。

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing the dataset

dataset=pd.read_csv('50_Startups.csv')

x=dataset.iloc[:,:-1].values #taking all the lines and cols expect last col(-1)

y=dataset.iloc[:,4].values

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

lableencoder_x=LabelEncoder()
x[:, 3]=lableencoder_x.fit_transform(x[:, 3])
onehotencoder=OneHotEncoder(categorical_features=[3])
x=onehotencoder.fit_transform(x).toarray()

#avoiding dummy variable trap

x=x[:,1:]

#splitig

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x,y,test_size=1/3, random_state= 0)

#fitting the linear regression to the training set

#linearregression package can do multiple linear regression
from sklearn.linear_model import LinearRegression

regressor=LinearRegression()
regressor.fit(x_train, y_train)
y_pred=regressor.predict(x_test)

#bulding optimal model using backwards elimination

import statsmodels.formula.api as sm

x=np.append(arr=np.ones((50,1)).astype(int), values=x,axis=1)

##### below are the repeated lines #####

x_opt=x[:,[0,1,2,3,4,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()
#column 2 has the p>0.05 so I removed and again optimizing the model
x_opt=x[:,[0,1,3,4,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()

#column 1 has the p>0.05 so I removed and again optimizing the model

x_opt=x[:,[0,3,4,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()

#column 4 has the p>0.05 so I removed and again optimizing the model

x_opt=x[:,[0,3,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()

#column 5 has the p>0.05 so I removed and again optimizing the model

x_opt=x[:,[0,3]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()

对于接近结尾的这些行,我想应用一些东西来避免重复。

这是我正在使用的数据集,它的名称是 50_startUps

标签: python

解决方案


这应该是相当明显的。

for cols in [
        [0,1,2,3,4,5],
        # column 2 has the p>0.05 so I removed and again optimizing the model
        [0,1,3,4,5],
        # column 1 has the p>0.05 so I removed and again optimizing the model
        [0,3,4,5],
        # column 4 has the p>0.05 so I removed and again optimizing the model
        [0,3,5],
        # column 5 has the p>0.05 so I removed and again optimizing the model
        [0,3]]:
    x_opt=x[:,cols]
    regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
    regressor_ols.summary()

还要注意评论中“列”的拼写。


推荐阅读