python - 如何对 svm 使用网格搜索?
问题描述
我认为机器学习很有趣,我正在研究 scikit learn 文档以获得乐趣。下面我做了一些数据清理,问题是我想使用网格搜索来找到参数的最佳值。
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
cats = ['sci.space','rec.autos','rec.motorcycles']
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories = cats)
vectorizer = TfidfVectorizer( stop_words = "english")
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)
clf = SVC(C=0.4,gamma=1,kernel='linear')
clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(accuracy_score(newsgroups_test.target, pred))
准确度为:0.849
我听说过网格搜索是为了找到参数的最佳值,但我不明白如何执行它。你能详细说明一下吗?这是我尝试过的,但不正确。我想学习正确的方法以及一些解释。谢谢
Cs = np.array([0.001, 0.01, 0.1, 1, 10])
gammas = np.array([0.001, 0.01, 0.1, 1])
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=dict(Cs=alphas,gamma=gammas))
grid.fit(newsgroups_train.data, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)
根据收到的答案进行编辑:
parameters = {'C': [1, 10],
'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)
它返回:
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False),
fit_params=None, iid='warn', n_jobs=None,
param_grid={'C': [1, 10], 'gamma': [0.001, 0.01, 1]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=0)
0.8532212885154061
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
我需要澄清这些:
1)What actually is displayed on the results?
2)Does it also take ranges for C as 1 to 10 or either 1 or 10?
3)Can you suggest anything to improve accuracy further?
4)I noticed that the Tfidf made the accuracy worse even though it
cleaned the data from words that dont have any value
解决方案
您想要传递参数字典,其中键是模型文档 (1) 定义的参数名称。这些值应该是您想尝试的值的列表。
然后网格搜索将调用这些参数的所有可能组合。文档 (2) 中有一些很好的示例。
- https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
- https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
对于您的脚本,您还需要确保为网格搜索提供正确的训练数据,在本例中,是“vectors”而不是“newsgroups_test.data”。
见下文:
parameters = {'C': [1, 10],
'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)
如果有效,请接受答案。祝你好运!
推荐阅读
- javascript - 在 JS 中获取 MM/DD/YYYY 格式的当前日期作为字符串?
- reactjs - 如何调试新克隆的 React Native 应用程序
- c - 在 C 中定义字符串的细节是什么?
- css - Vuetfiy - 替代 flexbox 浮动
- bash - 对于带参数的 bash
- java - 有没有办法可以将所有大于前一个的数字相加?
- javascript - Angular v9 问题:无法解析 h 的所有参数:(?, ?, ?)
- rust - 将相同变量绑定到共享特征的不同类型的模式
- javascript - 创建一个函数来切片而不使用 slice()
- python - 如何通过 tkinter 画布动作中的变量 = false 条件取消正在运行的函数