首页 > 技术文章 > 用sklearn(scikit-learn)的LogisticRegression预测titanic生还情况(kaggle)

morikokyuro 2018-02-23 18:03 原文

titanic, prediction using sklearn

after EDA, we can now preprocess the training data and learn a model to predict using scikit-learn (sklearn) ml library

做完上面的分析,可以选定几个特征进行使用,然后选择模型。

我们使用scikit-learn,这个框架对于基本的ml的method都有实现,方便使用,不需要自己from scratch编写代码。而且支持交叉验证。除非某些问题使用多层的dl神经网络更好,那么我们可以用tf或者theano等,如果传统机器学习方法可以解决,那么选择scikit-learn就可以。

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
## 读取train和test数据,并进行预处理:填充空缺,str转int类型转换,以及尺度归一化
path = './titanic/'
trainset = pd.read_csv(path + 'train.csv')
testset = pd.read_csv(path + 'test.csv')
print '[*] trainset shape is : ' + str(trainset.shape)
print '[*] testset shape is : ' + str(testset.shape)
## 填充空缺与数据类型转换
## 训练集上:
trainset.loc[trainset.Sex == 'male','Sex'] = 0
trainset.loc[trainset.Sex == 'female','Sex'] = 1
trainset.loc[trainset.Embarked == 'S','Embarked'] = 1
trainset.loc[trainset.Embarked == 'C','Embarked'] = 2
trainset.loc[trainset.Embarked == 'Q','Embarked'] = 3
trainset.Age = trainset.Age.fillna(trainset.Age.median())
trainset.Sex = trainset.Sex.fillna(trainset.Sex.mode()[0])
trainset.Fare = trainset.Fare.fillna(trainset.Fare.mean())
trainset.Pclass = trainset.Pclass.fillna(trainset.Pclass.mode()[0])
trainset.Embarked = trainset.Embarked.fillna(trainset.Embarked.mode()[0])
## 测试集上:(由于iid假设,fillna用了训练集的数据的中位数或众数,因为训练集比较大。也可训练集测试集合起来的众数中位数)
testset.loc[testset.Sex == 'male','Sex'] = 0
testset.loc[testset.Sex == 'female','Sex'] = 1
testset.loc[testset.Embarked == 'S','Embarked'] = 1
testset.loc[testset.Embarked == 'C','Embarked'] = 2
testset.loc[testset.Embarked == 'Q','Embarked'] = 3
testset.Age = testset.Age.fillna(trainset.Age.median())
testset.Sex = testset.Sex.fillna(trainset.Sex.mode()[0])
testset.Fare = testset.Fare.fillna(trainset.Fare.mean())
testset.Pclass = testset.Pclass.fillna(trainset.Pclass.mode()[0])
testset.Embarked = testset.Embarked.fillna(trainset.Embarked.mode()[0])
## 用StandardScaler进行训练集和测试集的尺度变换
AgeScaler = StandardScaler().fit(trainset[['Age']])
FareScaler = StandardScaler().fit(trainset[['Fare']])
#print AgeScaler.mean_ , AgeScaler.scale_
#print FareScaler.mean_, FareScaler.scale_
trainset.Age = AgeScaler.transform(trainset[['Age']])
trainset.Fare = FareScaler.transform(trainset[['Fare']])
testset.Age = AgeScaler.transform(testset[['Age']])
testset.Fare = FareScaler.transform(testset[['Fare']])
## 选择特征做逻辑斯蒂回归
print('[*] Using Logistic Regression Model')
features = ['Pclass','Sex','Age','Fare','Embarked']
predlabel = ['Survived']
train_X = trainset[features]
train_Y = trainset[predlabel]
test_X = testset[features]
LogReg = LogisticRegressionCV(random_state=0)
LogReg.fit(train_X,train_Y)
test_Y_hat = LogReg.predict(test_X)
print('[*] prediction completed')
submission = pd.DataFrame(columns=['PassengerId','Survived'])
submission['PassengerId'] = range(892,1310)
submission['Survived'] = test_Y_hat
#trainset.head(10)
#pd.read_csv(path+'gender_submission.csv')
## 按照格式,存成不含index的csv文件。
submission.to_csv('./titanic/logreg_submission.csv',index=False)
print('[*] result saved')
print('[*] done')
[*] trainset shape is : (891, 12)
[*] testset shape is : (418, 11)
[*] Using Logistic Regression Model
[*] prediction completed
[*] result saved
[*] done

上面我们使用了LogisticRegressionCV, instead of 之前的LogisticRegression,相当于做了一次cross validation,实际上调参调整了C,也是就是正则项系数。这个改变提高了439个place的得分。

这里写图片描述

考虑加上SibSp和Parch这俩特征,看看有没有用:

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
## 读取train和test数据,并进行预处理:填充空缺,str转int类型转换,以及尺度归一化
path = './titanic/'
trainset = pd.read_csv(path + 'train.csv')
testset = pd.read_csv(path + 'test.csv')
print '[*] trainset shape is : ' + str(trainset.shape)
print '[*] testset shape is : ' + str(testset.shape)
## 填充空缺与数据类型转换
## 训练集上:
trainset.loc[trainset.Sex == 'male','Sex'] = 0
trainset.loc[trainset.Sex == 'female','Sex'] = 1
trainset.loc[trainset.Embarked == 'S','Embarked'] = 1
trainset.loc[trainset.Embarked == 'C','Embarked'] = 2
trainset.loc[trainset.Embarked == 'Q','Embarked'] = 3
trainset.Age = trainset.Age.fillna(trainset.Age.median())
trainset.Sex = trainset.Sex.fillna(trainset.Sex.mode()[0])
trainset.Fare = trainset.Fare.fillna(trainset.Fare.mean())
trainset.Pclass = trainset.Pclass.fillna(trainset.Pclass.mode()[0])
trainset.Embarked = trainset.Embarked.fillna(trainset.Embarked.mode()[0])
trainset.SibSp = trainset.SibSp.fillna(trainset.SibSp.mode()[0])
trainset.Parch = trainset.Parch.fillna(trainset.Parch.mode()[0])
## 测试集上:(由于iid假设,fillna用了训练集的数据的中位数或众数,因为训练集比较大。也可训练集测试集合起来的众数中位数)
testset.loc[testset.Sex == 'male','Sex'] = 0
testset.loc[testset.Sex == 'female','Sex'] = 1
testset.loc[testset.Embarked == 'S','Embarked'] = 1
testset.loc[testset.Embarked == 'C','Embarked'] = 2
testset.loc[testset.Embarked == 'Q','Embarked'] = 3
testset.Age = testset.Age.fillna(trainset.Age.median())
testset.Sex = testset.Sex.fillna(trainset.Sex.mode()[0])
testset.Fare = testset.Fare.fillna(trainset.Fare.mean())
testset.Pclass = testset.Pclass.fillna(trainset.Pclass.mode()[0])
testset.Embarked = testset.Embarked.fillna(trainset.Embarked.mode()[0])
testset.SibSp = testset.SibSp.fillna(trainset.SibSp.mode()[0])
testset.Parch = testset.Parch.fillna(trainset.Parch.mode()[0])
## 用StandardScaler进行训练集和测试集的尺度变换
AgeScaler = StandardScaler().fit(trainset[['Age']])
FareScaler = StandardScaler().fit(trainset[['Fare']])
#print AgeScaler.mean_ , AgeScaler.scale_
#print FareScaler.mean_, FareScaler.scale_
trainset.Age = AgeScaler.transform(trainset[['Age']])
trainset.Fare = FareScaler.transform(trainset[['Fare']])
testset.Age = AgeScaler.transform(testset[['Age']])
testset.Fare = FareScaler.transform(testset[['Fare']])
## 选择特征做逻辑斯蒂回归
print('[*] Using Logistic Regression Model')
features = ['Pclass','Sex','Age','Fare','Embarked','SibSp','Parch']
predlabel = ['Survived']
train_X = trainset[features]
train_Y = trainset[predlabel]
test_X = testset[features]
LogReg = LogisticRegressionCV(random_state=0)
LogReg.fit(train_X,train_Y)
test_Y_hat = LogReg.predict(test_X)
print('[*] prediction completed')
submission = pd.DataFrame(columns=['PassengerId','Survived'])
submission['PassengerId'] = range(892,1310)
submission['Survived'] = test_Y_hat
#trainset.head(10)
#pd.read_csv(path+'gender_submission.csv')
## 按照格式,存成不含index的csv文件。
submission.to_csv('./titanic/logreg_submission.csv',index=False)
print('[*] result saved')
print('[*] done')
[*] trainset shape is : (891, 12)
[*] testset shape is : (418, 11)
[*] Using Logistic Regression Model
[*] prediction completed
[*] result saved
[*] done

这里写图片描述

果然可以提高一点。在之前分析的感觉没多少相关性的特征通过logistic Regression算法以后也可以提高分类准确率。另外,还可以通过考虑Name中的头衔,以及舱位编号(可以参考titanic的船体结构图)等等,来提高分类准确率。另外也可以换其他模型,并采用Ensemble集成。由于希望将这个problem仅仅作为toy problem用来熟悉环境和方法,所以不再进行进一步的探究,可以在实际问题中投入较多的时间进行不同模型选择以及cross validation和ensemble来提高模型效率。

2018年02月23日18:01:36
我们之所以冒险,正是因为上帝给了我们这副臭皮囊,而非不顾生命。 —— 斯蒂芬 金

最后还是用了个随机森林试一试,发现效果很明显呀

这里写图片描述

看来还是要多试试几个模型,以及调参数。

推荐阅读