python - 为什么 GridSearchCV 方法的准确性低于标准方法?
问题描述
我使用 train_test_split ( random_state = 0
) 和没有任何参数调整的决策树来建模我的数据,我运行它大约 50 次以达到最佳精度。
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Laptop = pd.ExcelFile(r"D:\Laptop.xlsx", data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)
train, test = train_test_split(data, test_size = 0.15)
print("Training size: {}; Test size: {}".format(len(train), len(test)))
c = DecisionTreeClassifier()
features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]
x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]
dt = c.fit(x_train, y_train)
y_pred = c.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100
print ("Accuracy using Decision Tree:", round(score, 1), "%")
在第二步中,我决定使用 GridSearchCV 方法来设置树参数。
import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
%matplotlib inline
Laptop = pd.ExcelFile(r"D:\Laptop.xlsx", data_only=True)
data = pd.read_excel(r"D:\Laptop.xlsx",sheet_name=0)
train, test = train_test_split(data, test_size = 0.15, random_state = 0)
print("Training size: {}; Test size: {}".format(len(train), len(test)))
features = ["Brand", "Size", "CPU", "RAM", "Resolution", "Class"]
x_train = train[features]
y_train = train["K=20"]
x_test = test[features]
y_test = test["K=20"]
from sklearn.model_selection import GridSearchCV
param_dist = {"max_depth":[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
"min_samples_leaf":randint (10,60)}
tree = DecisionTreeClassifier()
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_cv.fit(x_train, y_train)
print("Tuned Decisio Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is: {}".format(tree_cv.best_score_))
y_pred = tree_cv.predict(x_test)
from sklearn.metrics import accuracy_score
score = accuracy_score(y_test, y_pred)*100
print ("Accuracy using Decision Tree:", round(score, 1), "%")
我在第一种方法中的最佳准确性比 GridSearchCV 方法好得多。
为什么会这样?
你知道以最准确的方式获得最好的树的最佳方法吗?
解决方案
这取决于您为 GridSearchCV 指定的参数限制。
没有任何参数的决策树的参数默认值不在您手动指定的范围内。选择一组更好的参数并再次尝试 GridSearchCV。
推荐阅读
- java - 在 Mockito 中模拟对象创建的不同方法?
- javascript - JavaScript 中是否有一种简单的方法来获取其属性的字符串表示形式?
- php - 注意:尝试访问第 XX 行中 bool 类型值的数组偏移量
- python - 从 Win32 二进制文件调用匿名 C 函数
- django - 如何在 Django 模型中为继承的模型类字段设置一些值?
- javascript - 我想在循环中的所有查询完成后返回结果(express.js、react.js、mysql)
- android - 无法通过 android studio 在 google 日历中创建事件
- c# - 在 C# 中将字符串的一部分转换为日期时间
- sql - SQLite - 如何从一个表中选择不在另一个表中的记录
- excel - 以编程方式设置链接到单元格值的文本框