python - 如何使用决策树获得拟合值?
问题描述
我正在使用决策树根据剩余列(0 和 1)的值来预测输入文件的第一列(T 或 N)。我的输入文件采用以下形式:
T,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
N,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
我想拟合我的预测并有一个拟合值y_predfit
(如果y_predfit >threshold
那么prediction=T
其他prediction=N
。我已经使用以下代码行来获取y_predfit
,但是当我打印时y_predfit
,我得到的只是一组 0,所以我没有得到我想要的拟合值,我不确定我是否使用了正确的代码行。如何实现我想要的并获得拟合值(y_predfit
)
clf_gini.fit(X_test,y_test)
y_predfit = tree.DecisionTreeClassifier(X_test)
源代码
# Run this program on your local python
# interpreter, provided you have installed
# the required libraries.
# Importing the required packages
import numpy as np
import pandas as pd
from sklearn.metrics import confusion_matrix
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
from sklearn import tree
import collections
import pydotplus
# Function importing Dataset
column_count =0
def importdata():
balance_data = pd.read_csv( 'data1extended.txt', sep= ',')
row_count, column_count = balance_data.shape
# Printing the dataswet shape
print ("Dataset Length: ", len(balance_data))
print ("Dataset Shape: ", balance_data.shape)
print("Number of columns ", column_count)
# Printing the dataset obseravtions
print ("Dataset: ",balance_data.head())
balance_data['gold'] = balance_data['gold'].astype('category').cat.codes
return balance_data, column_count
def columns(balance_data):
row_count, column_count = balance_data.shape
return column_count
# Function to split the dataset
def splitdataset(balance_data, column_count):
# Separating the target variable
X = balance_data.values[:, 1:column_count]
Y = balance_data.values[:, 0]
# Splitting the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size = 0.3, random_state = 100)
return X, Y, X_train, X_test, y_train, y_test
# Function to perform training with giniIndex.
def train_using_gini(X_train, X_test, y_train):
# Creating the classifier object
clf_gini = DecisionTreeClassifier(criterion = "gini",
random_state = 100,max_depth=3, min_samples_leaf=5)
# Performing training
clf_gini.fit(X_train, y_train)
return clf_gini
# Function to make predictions
def prediction(X_test, clf_object):
# Predicton on test with giniIndex
y_pred = clf_object.predict(X_test)
print("Predicted values:")
print(y_pred)
return y_pred
def main():
# Building Phase
data,column_count = importdata()
X, Y, X_train, X_test, y_train, y_test = splitdataset(data, column_count)
clf_gini = train_using_gini(X_train, X_test, y_train)
#tried to generate the fit value here and failed
clf_gini.fit(X_test,y_test)
y_predfit = tree.DecisionTreeClassifier(X_test)
print('FIT: ',y_predfit)
if __name__=="__main__":
main()
解决方案
推荐阅读
- bash - 在命令中设置变量与在命令之前设置变量之间的区别
- excel - 如何在 Excel 上水平返回唯一值
- webpack - Webpack - 错误 [ERR_PACKAGE_PATH_NOT_EXPORTED]:没有定义“出口”主要
- single-sign-on - 想要将 argo 服务器与 keycloak 集成
- php - 为什么 lumen 没有配置缓存?
- kubernetes - 使用名称和字段选择器列出 Pod
- android - 为什么我们在伴随对象中声明数据绑定适配器?
- apache-spark - 如何确定在 Spark 上消费 Kafka 的位置
- css - 在背景图片 CSS Flexbox 上垂直居中文本的问题
- google-cloud-storage - 来自 Google Cloud Storage 和 CDN 的音频流 - 费用