1.利用决策树算法对train_feature.csv进行训练对test_feature.csv进行预测(练习调参),并计算预测正确的准确率。(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量
import pandas as pd
from sklearn import datasets
from sklearn. tree import DecisionTreeClassifier
from sklearn. model_selection import train_test_split, GridSearchCV
import time, datetime
train_df= pd. read_csv( "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\train_feature.csv" )
train_df. head( )
ip
app
device
os
channel
is_attributed
day
hour
minute
ip_count
app_count
device_count
os_count
channel_count
hour_count
minute_count
0
83230
3
1
13
379
0
2017-11-06
14
32
938
774123
6527713
1541988
101195
48
110457
1
17357
3
1
19
379
0
2017-11-06
14
33
677
774123
6527713
1644220
101195
48
112948
2
35810
3
1
13
379
0
2017-11-06
14
34
351
774123
6527713
1541988
101195
48
112532
3
45745
14
1
13
478
0
2017-11-06
14
34
7786
316214
6527713
1541988
11355
48
112532
4
161007
3
1
13
379
0
2017-11-06
14
35
132
774123
6527713
1541988
101195
48
115570
test_df= pd. read_csv( "C:\\Users\\zzh\\Desktop\\dataMiningExperment\\数据挖掘实训课件\\数据挖掘第3次实训\\数据\\训练和预测用数据--做题用\\test_feature.csv" )
test_df. head( )
click_id
ip
app
device
os
channel
is_attributed
day
hour
minute
ip_count
app_count
device_count
os_count
channel_count
hour_count
minute_count
0
0
19870
2
1
13
435
0
2017-11-06
23
1
99
308059
2853433
657790
42678
2308568
68675
1
1
50314
15
1
17
265
0
2017-11-06
23
1
233
307505
2853433
153419
68057
2308568
68675
2
2
183513
15
1
13
153
0
2017-11-06
23
1
105
307505
2853433
657790
104935
2308568
68675
3
3
35731
12
1
19
178
0
2017-11-06
23
1
550
348786
2853433
765928
89744
2308568
68675
4
4
186444
12
1
3
265
0
2017-11-06
23
1
16
348786
2853433
45955
68057
2308568
68675
(由于训练数据的数据分布不平衡,所以将数据进行平衡处理,正样本保留,负样本从原负样本中取出正样本数量的n倍)
train_df[ "is_attributed" ] . value_counts( )
0 6986725
1 13275
Name: is_attributed, dtype: int64
tmp_is1 = train_df[ train_df[ 'is_attributed' ] == 1 ]
tmp_is0 = train_df[ train_df[ 'is_attributed' ] == 0 ]
tmp_is0 = tmp_is0. sample( n= tmp_is1. shape[ 0 ] * 5 )
train_df= tmp_is1. append( tmp_is0)
删除‘day’列
print ( train_df[ "day" ] . value_counts( ) )
print ( test_df[ "day" ] . value_counts( ) )
2017-11-06 79650
Name: day, dtype: int64
2017-11-06 2308568
2017-11-07 691432
Name: day, dtype: int64
train_df1= train_df. drop( [ 'day' ] , axis= 1 )
test_df1= test_df. drop( [ 'day' ] , axis= 1 )
test_df1= test_df1. drop( [ 'click_id' ] , axis= 1 )
y_train= train_df1[ [ 'is_attributed' ] ] . values
y_test= test_df1[ [ 'is_attributed' ] ] . values
x_train= train_df1. drop( [ 'is_attributed' ] , axis= 1 )
x_test= test_df1. drop( [ 'is_attributed' ] , axis= 1 )
clf = DecisionTreeClassifier( )
clf. fit( x_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
print ( "训练数据的score:" , ( clf. score( x_train, y_train) ) )
print ( "测试数据的score:" , ( clf. score( x_test, y_test) ) )
训练数据的score: 0.9996359070935342
测试数据的score: 0.8894056666666666
predict= clf. predict( x_test)
submission = pd. DataFrame ( {
'click_id' : test_df[ 'click_id' ] ,
'is_attributed' : predict
} )
说明:准确率=(测试集中预测为下载)&(测试集中实际为下载)/测试集实际为下载的数据量
print ( "准确率:" , sum ( ( predict == 1 ) & ( test_df. is_attributed== 1 ) ) / sum ( test_df. is_attributed== 1 ) )
准确率: 0.7458654906284454
大家好,我是[爱做梦的子浩](https://blog.csdn.net/weixin_43124279),我是东北大学大数据实验班大三的小菜鸡,非常向往优秀,羡慕优秀的人,已拿两个暑假offer,欢迎大家找我进行交流