Python Tutorial: Preparation & Basic Regression

5.1 Pre-process a data set using principal component analysis.

# Notice we are using a new data set that needs to be read into the
# environment
iris = pd.read_csv('/Users/iris.csv')
features = iris.drop(["Target"], axis = 1)
from sklearn import preprocessing
features_scaled = preprocessing.scale(features.as_matrix())
from sklearn.decomposition import PCA
pca = PCA(n_components = 4)
pca = pca.fit(features_scaled)
print(np.transpose(pca.components_))

5.2 Split data into training and testing data and export as a .csv file.

from sklearn.model_selection import train_test_split
target = iris["Target"]
# The following code splits the iris data set into 70% train and 30% test
X_train, X_test, Y_train, Y_test = train_test_split(features, target,
test_size = 0.3,
random_state = 29)
train_x = pd.DataFrame(X_train)
train_y = pd.DataFrame(Y_train)
test_x = pd.DataFrame(X_test)
test_y = pd.DataFrame(Y_test)
train = pd.concat([train_x, train_y], axis = 1)
test = pd.concat([test_x, test_y], axis = 1)
train.to_csv('/Users/iris_train_Python.csv', index = False)
test.to_csv('/Users/iris_test_Python.csv', index = False)

5.3 Fit a logistic regression model.

# Notice we are using a new data set that needs to be read into the
# environment
tips = pd.read_csv('/Users/tips.csv')
# The following code is used to determine if the individual left more
# than a 15% tip
tips["fifteen"] = 0.15 * tips["total_bill"]
tips["greater15"] = np.where(tips["tip"] > tips["fifteen"], 1, 0)
import statsmodels.api as sm
# Notice the syntax of greater15 as a function of total_bill
logreg = sm.formula.glm("greater15 ~ total_bill",
family=sm.families.Binomial(),
data=tips).fit()
print(logreg.summary())

#A logistic regression model can be implemented using sklearn, howeverstatsmodels.api provides a helpful summary about the model, so it is preferable for this example.

5.4 Fit a linear regression model.

# Fit a linear regression model of tip by total_bill
from sklearn.linear_model import LinearRegression
# If your data has one feature, you need to reshape the 1D array
linreg = LinearRegression()
linreg.fit(tips["total_bill"].values.reshape(-1,1), tips["tip"])
print(linreg.coef_)
print(linreg.intercept_)

Python Tutorial: Preparation & Basic Regression

推荐阅读