首页 > 技术文章 > 【Python】决策树算法(DecisionTreeClassifier)——东北大学数据挖掘实训三

DreamingFishZIHao 2019-12-26 20:26 原文

在这里插入图片描述
在这里插入图片描述在这里插入图片描述

1.利用决策树算法对train_feature.csv进行训练对test_feature.csv进行预测(练习调参),并计算预测正确的准确率。(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量

import pandas as pd
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split,GridSearchCV
import time,datetime
train_df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\train_feature.csv")
train_df.head()
ip app device os channel is_attributed day hour minute ip_count app_count device_count os_count channel_count hour_count minute_count
0 83230 3 1 13 379 0 2017-11-06 14 32 938 774123 6527713 1541988 101195 48 110457
1 17357 3 1 19 379 0 2017-11-06 14 33 677 774123 6527713 1644220 101195 48 112948
2 35810 3 1 13 379 0 2017-11-06 14 34 351 774123 6527713 1541988 101195 48 112532
3 45745 14 1 13 478 0 2017-11-06 14 34 7786 316214 6527713 1541988 11355 48 112532
4 161007 3 1 13 379 0 2017-11-06 14 35 132 774123 6527713 1541988 101195 48 115570
test_df=pd.read_csv("C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\test_feature.csv")
test_df.head()
click_id ip app device os channel is_attributed day hour minute ip_count app_count device_count os_count channel_count hour_count minute_count
0 0 19870 2 1 13 435 0 2017-11-06 23 1 99 308059 2853433 657790 42678 2308568 68675
1 1 50314 15 1 17 265 0 2017-11-06 23 1 233 307505 2853433 153419 68057 2308568 68675
2 2 183513 15 1 13 153 0 2017-11-06 23 1 105 307505 2853433 657790 104935 2308568 68675
3 3 35731 12 1 19 178 0 2017-11-06 23 1 550 348786 2853433 765928 89744 2308568 68675
4 4 186444 12 1 3 265 0 2017-11-06 23 1 16 348786 2853433 45955 68057 2308568 68675

(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)

train_df["is_attributed"].value_counts()
0    6986725
1      13275
Name: is_attributed, dtype: int64
tmp_is1 = train_df[train_df['is_attributed']==1] #13275
tmp_is0 = train_df[train_df['is_attributed']==0] #6986725
tmp_is0 = tmp_is0.sample(n=tmp_is1.shape[0]*5)
train_df= tmp_is1.append(tmp_is0) #合并

删除‘day’列

print(train_df["day"].value_counts()) #都是同一天,没有用,删掉
print(test_df["day"].value_counts()) 
2017-11-06    79650
Name: day, dtype: int64
2017-11-06    2308568
2017-11-07     691432
Name: day, dtype: int64
train_df1=train_df.drop(['day'],axis=1)
test_df1=test_df.drop(['day'],axis=1)
test_df1=test_df1.drop(['click_id'],axis=1)
# 分割数据,取两个数据特征做为训练数据的特征,测试时发现如何将四个特征都做用起来,
# 准确率基本为 1,这样反而不方便调试了
y_train=train_df1[['is_attributed']].values
y_test=test_df1[['is_attributed']].values
x_train=train_df1.drop(['is_attributed'],axis=1)
x_test=test_df1.drop(['is_attributed'],axis=1)
# 初始化模型, max_depth 限制树的最大深度
# 可以使用"gini"或者"entropy",前者代表基尼系数,后者代表信息增益。
# 一般说使用默认的基尼系数"gini"就可以了,即CART算法。除非你更喜欢类似ID3, C4.5的最优特征选择方法。
clf = DecisionTreeClassifier()
# 训练模型
clf.fit(x_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
print("训练数据的score:", (clf.score(x_train, y_train)))
print("测试数据的score:", (clf.score(x_test, y_test)))
训练数据的score: 0.9996359070935342
测试数据的score: 0.8894056666666666
predict=clf.predict(x_test) #预测
submission = pd.DataFrame ( {
    'click_id':test_df['click_id'],
    'is_attributed':predict
} )
# submission.to_csv('submission.csv',index=False)

说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量

print("准确率:",sum((predict == 1) & (test_df.is_attributed==1)) / sum(test_df.is_attributed==1))
准确率: 0.7458654906284454
# 笨的求准确率的办法
# denominator=test_df[test_df.is_attributed==1].shape[0]  #分母
# denominator
# test_df['is_attributed']=test_df['is_attributed'].replace(0,2) 
# submission['is_attributed']=submission['is_attributed'].replace(0,3) 
# molecule=(test_df['is_attributed']==submission['is_attributed']).sum() ##分子
# molecule
# Precision=molecule/denominator
# print(Precision)  #准确率

推荐阅读