首页 > 解决方案 > 如何对 svm 使用网格搜索?

问题描述

我认为机器学习很有趣,我正在研究 scikit learn 文档以获得乐趣。下面我做了一些数据清理,问题是我想使用网格搜索来找到参数的最佳值。

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


cats = ['sci.space','rec.autos','rec.motorcycles']
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories = cats)

vectorizer = TfidfVectorizer( stop_words = "english")


vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)

clf =  SVC(C=0.4,gamma=1,kernel='linear')

clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(accuracy_score(newsgroups_test.target, pred))

准确度为:0.849

我听说过网格搜索是为了找到参数的最佳值,但我不明白如何执行它。你能详细说明一下吗?这是我尝试过的,但不正确。我想学习正确的方法以及一些解释。谢谢

Cs = np.array([0.001, 0.01, 0.1, 1, 10])
gammas = np.array([0.001, 0.01, 0.1, 1])
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=dict(Cs=alphas,gamma=gammas))
grid.fit(newsgroups_train.data, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

根据收到的答案进行编辑:

parameters = {'C': [1, 10], 
          'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)

它返回:

GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'C': [1, 10], 'gamma': [0.001, 0.01, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
0.8532212885154061
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

我需要澄清这些:

1)What actually is displayed on the results?
2)Does it also take ranges for C as 1 to 10 or either 1 or 10? 
3)Can you suggest anything    to improve accuracy further?  
4)I noticed that the Tfidf made the accuracy worse even though it 
              cleaned the data from words that dont have any value

标签: pythonmachine-learningscikit-learnsvm

解决方案


您想要传递参数字典,其中键是模型文档 (1) 定义的参数名称。这些值应该是您想尝试的值的列表。

然后网格搜索将调用这些参数的所有可能组合。文档 (2) 中有一些很好的示例。

  1. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
  2. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

对于您的脚本,您还需要确保为网格搜索提供正确的训练数据,在本例中,是“vectors”而不是“newsgroups_test.data”。

见下文:

parameters = {'C': [1, 10], 
          'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)

如果有效,请接受答案。祝你好运!


推荐阅读