首页 > 解决方案 > 如何修复 ValueError:n_splits=10 错误 sklearn NLP

问题描述

我第一次尝试制作multi class classification,我第一次使用 scikit-learn,我在网上找到 了这段代码,并试图将它用于我的数据,
我的数据看起来像这样

id                      Text                                           Tags
----------------------------------------------------------------------------
1    Tears made her vision blur again                                  blue
2    She looked away, outside, at the blur of snow as he continued.    blue
3    Mr. Green, you are wanted on the phone                            green
4    I prefer oranges to apples                                        orange
5    Tom drank his orange juice                                        black

这是我的代码

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

df = pd.read_csv('./dataSet03.csv')
col = ['Text', 'Tags']
data = df[col]
data.columns =['Text', 'Tags']
df['id'] = df['Tags'].factorize()[0]
product_id_data = df[['Tags', 'id']].drop_duplicates().sort_values('id')
product_to_id = dict(product_id_data.values)
id_to_product = dict(product_id_data[['id', 'Tags']].values)
tfidf = TfidfVectorizer(sublinear_tf=True, 
                        min_df=5, 
                        norm='l2', 
                        encoding='latin-1', 
                        ngram_range=(1, 2),
                        stop_words='english')
features = tfidf.fit_transform(df.Text).toarray()
labels = df.id
X_train, X_test, y_train, y_test = train_test_split(df.Text, df.Tags, random_state=0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),#Linear Support Vector Classification.
    MultinomialNB(),#Naive Bayes classifier for multinomial models
    LogisticRegression(random_state=0),
]
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())

我的代码到达此行时出现此错误

accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)

这是错误

ValueError: n_splits=10 cannot be greater than the number of members in each class.

标签: pythonscikit-learnnlp

解决方案


您正在使用id作为培训标签,这看起来像是您的示例中的一个独特条目,所以这完全没有意义。您将拥有与观察次数一样多的课程。

您很可能想使用Tags,下面是一个示例:

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
                   'Text':['Tears made her vision'
                           'blur again, She looked', 
                           'away, outside',
                           'at the blur of snow',
                           'as he continued.',
                           'Mr. Green, you are',
                           'wanted on the phone',
                           'I prefer oranges',
                           'to apples',
                           'Tom drank his',
                           'orange juice'],
                   'Tags':['blue','blue','green','orange','green','orange','blue','green','orange','blue']
                  })

使用 CV=3 运行代码:

tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df.Text).toarray()
labels = df.Tags

models = [
    RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
    LinearSVC(),#Linear Support Vector Classification.
    MultinomialNB(),#Naive Bayes classifier for multinomial models
    LogisticRegression(random_state=0),
]
CV = 3
#cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
    model_name = model.__class__.__name__
    accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
    for fold_idx, accuracy in enumerate(accuracies):
        entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())

在这个例子中,没有足够的数据来运行 CV=10,但是只要每个类至少有 10 个成员,就可以运行 CV=10。上面的代码给出了这个输出:

model_name
LinearSVC                 0.222222
LogisticRegression        0.222222
MultinomialNB             0.388889
RandomForestClassifier    0.388889
Name: accuracy, dtype: float64

推荐阅读