python - 如何修复 ValueError:n_splits=10 错误 sklearn NLP
问题描述
我第一次尝试制作multi class classification
,我第一次使用 scikit-learn,我在网上找到 了这段代码,并试图将它用于我的数据,
我的数据看起来像这样
id Text Tags
----------------------------------------------------------------------------
1 Tears made her vision blur again blue
2 She looked away, outside, at the blur of snow as he continued. blue
3 Mr. Green, you are wanted on the phone green
4 I prefer oranges to apples orange
5 Tom drank his orange juice black
这是我的代码
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
df = pd.read_csv('./dataSet03.csv')
col = ['Text', 'Tags']
data = df[col]
data.columns =['Text', 'Tags']
df['id'] = df['Tags'].factorize()[0]
product_id_data = df[['Tags', 'id']].drop_duplicates().sort_values('id')
product_to_id = dict(product_id_data.values)
id_to_product = dict(product_id_data[['id', 'Tags']].values)
tfidf = TfidfVectorizer(sublinear_tf=True,
min_df=5,
norm='l2',
encoding='latin-1',
ngram_range=(1, 2),
stop_words='english')
features = tfidf.fit_transform(df.Text).toarray()
labels = df.id
X_train, X_test, y_train, y_test = train_test_split(df.Text, df.Tags, random_state=0)
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = MultinomialNB().fit(X_train_tfidf, y_train)
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),#Linear Support Vector Classification.
MultinomialNB(),#Naive Bayes classifier for multinomial models
LogisticRegression(random_state=0),
]
CV = 10
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())
我的代码到达此行时出现此错误
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
这是错误
ValueError: n_splits=10 cannot be greater than the number of members in each class.
解决方案
您正在使用id
作为培训标签,这看起来像是您的示例中的一个独特条目,所以这完全没有意义。您将拥有与观察次数一样多的课程。
您很可能想使用Tags
,下面是一个示例:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
df = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
'Text':['Tears made her vision'
'blur again, She looked',
'away, outside',
'at the blur of snow',
'as he continued.',
'Mr. Green, you are',
'wanted on the phone',
'I prefer oranges',
'to apples',
'Tom drank his',
'orange juice'],
'Tags':['blue','blue','green','orange','green','orange','blue','green','orange','blue']
})
使用 CV=3 运行代码:
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(df.Text).toarray()
labels = df.Tags
models = [
RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0),
LinearSVC(),#Linear Support Vector Classification.
MultinomialNB(),#Naive Bayes classifier for multinomial models
LogisticRegression(random_state=0),
]
CV = 3
#cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
print(cv_df.groupby('model_name').accuracy.mean())
在这个例子中,没有足够的数据来运行 CV=10,但是只要每个类至少有 10 个成员,就可以运行 CV=10。上面的代码给出了这个输出:
model_name
LinearSVC 0.222222
LogisticRegression 0.222222
MultinomialNB 0.388889
RandomForestClassifier 0.388889
Name: accuracy, dtype: float64
推荐阅读
- java - Java Geolocation 计算 - 没有得到正确的值
- python - SymbolicTransformer 生成的新特征不符合规则?
- python - 使用 selenium 单击多个链接
- jquery - 如何将事件绑定到jQuery中的所有输入字段
- go - 在未导出的字段上调用导出的方法
- mysql - 寻找有关如何使用随机 PII 数据生成数据库的指导
- c# - 我想下载我在用户系统上使用 XmlWriter.Create() 创建的 C# 中的 XML 文件
- sql - Rstudio 错误:输入“”不匹配。期待:
- google-cloud-functions - Google Functions + 在实例上执行 bash 脚本
- c# - Angular 8 和 C# 控制器