python - 避免重复,用循环?
问题描述
我正在学习多元线性回归,我正在使用反向消除来优化我的模型和 python 作为编程语言。
我重复使用三行代码来删除显着值> 0.05(即p> 0.05)的列。我想为这些行应用一个循环或一个函数,这样我就可以避免重复。
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
#importing the dataset
dataset=pd.read_csv('50_Startups.csv')
x=dataset.iloc[:,:-1].values #taking all the lines and cols expect last col(-1)
y=dataset.iloc[:,4].values
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
lableencoder_x=LabelEncoder()
x[:, 3]=lableencoder_x.fit_transform(x[:, 3])
onehotencoder=OneHotEncoder(categorical_features=[3])
x=onehotencoder.fit_transform(x).toarray()
#avoiding dummy variable trap
x=x[:,1:]
#splitig
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test= train_test_split(x,y,test_size=1/3, random_state= 0)
#fitting the linear regression to the training set
#linearregression package can do multiple linear regression
from sklearn.linear_model import LinearRegression
regressor=LinearRegression()
regressor.fit(x_train, y_train)
y_pred=regressor.predict(x_test)
#bulding optimal model using backwards elimination
import statsmodels.formula.api as sm
x=np.append(arr=np.ones((50,1)).astype(int), values=x,axis=1)
##### below are the repeated lines #####
x_opt=x[:,[0,1,2,3,4,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()
#column 2 has the p>0.05 so I removed and again optimizing the model
x_opt=x[:,[0,1,3,4,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()
#column 1 has the p>0.05 so I removed and again optimizing the model
x_opt=x[:,[0,3,4,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()
#column 4 has the p>0.05 so I removed and again optimizing the model
x_opt=x[:,[0,3,5]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()
#column 5 has the p>0.05 so I removed and again optimizing the model
x_opt=x[:,[0,3]]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()
对于接近结尾的这些行,我想应用一些东西来避免重复。
解决方案
这应该是相当明显的。
for cols in [
[0,1,2,3,4,5],
# column 2 has the p>0.05 so I removed and again optimizing the model
[0,1,3,4,5],
# column 1 has the p>0.05 so I removed and again optimizing the model
[0,3,4,5],
# column 4 has the p>0.05 so I removed and again optimizing the model
[0,3,5],
# column 5 has the p>0.05 so I removed and again optimizing the model
[0,3]]:
x_opt=x[:,cols]
regressor_ols=sm.OLS(endog=y,exog=x_opt).fit()
regressor_ols.summary()
还要注意评论中“列”的拼写。
推荐阅读
- android-studio - 我的应用程序不适用于 OnTouchListener
- filter - 刷新页面时带有过滤器的Angular 6 queryParamMap不起作用
- java - 寻找 docusign 的基本结构
- apache-spark - spark结构化流异常:不支持不带水印的附加输出模式
- python-3.x - UnpicklingError: 无效的加载键,'\x0a'
- node.js - 使用节点缓存模块不在 AWS lambda 中缓存数据
- databricks - 如何在 Databricks Python Notebook 中运行/执行输入单元
- java - Spring集成测试消耗大量内存,在GradleWorkerMain中使用大量重复线程
- excel - 用户窗体打开后如何执行代码?
- ios - 为什么在失败的 if 语句之后不会出现警报?