python - 为什么我所有的分类准确度分数都一样?
问题描述
我正在运行几个机器学习模型来找到准确度得分最高的模型,但是,所有的准确度得分都是完全相同的。我在社交媒体文本上执行了 NLP,并且我正在训练我的模型以根据从 NLTK 确定的情绪来标记情绪。
我使用的是相同的训练集和测试集,但我以前做过这种方法,并且在不同的模型上获得了不同的分数。为什么我的都一样?我是否过度拟合?
这是我要拆分和训练的代码:
submissions_sentiment = submissions_df[["Clean_Body", "Clean_Title", "sentiment_label"]]
dataset = submissions_sentiment
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1].values
X_arr = []
for index, row in X.iterrows():
X_arr.append(row.values)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_arr, y, test_size = 0.2, random_state = 0)
def identity_tokenizer(text):
return text
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(tokenizer=identity_tokenizer, lowercase=False)
# fit AND transform the model (only for training data)
X_train_vectors = vectorizer.fit_transform(X_train)
# transform the test data
X_test_vectors = vectorizer.transform(X_test)
# Linear SVM
from sklearn import svm
clf_svm = svm.SVC(kernel="linear")
clf_svm.fit(X_train_vectors, y_train)
clf_svm_pred = clf_svm.predict(X_test_vectors)
# Evaluate Model Accuracy
from sklearn.metrics import accuracy_score
accuracy_score(y_test, clf_svm_pred)
# Output is .86
# Naive Bayes
from sklearn.naive_bayes import GaussianNB
clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(X_train_vectors, y_train)
clf_gnb_pred = clf_gnb.predict(X_test_vectors)
# Evaluate Model Accuracy
accuracy_score(y_test, clf_gnb_pred)
# Output is .86
这是X-train的一个例子:
# Review data ouput
print(X_train_vectors.toarray())
print(X_train[0])
print(X_train_vectors[0])
[[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
['I started really investing this year and looking for long term holdings After about 5 months or so I have decided to start putting money into ETFs for the time being while I research and learn about companies more For ETFs Im thinking about are the followingVOOQQQIm looking for another ETF that is not apart of Tech to kind of help diversify my holdings I was wondering if XLC would be a good third ETF My plan right now is each month put X amount into a single ETF then the next month put it into the next ETF etc and essentially continously put money into all three ETFs Im in my late 20s and my goal is to hold long term 10 15 years or longer If anyone has suggestions on other ETFs I would greatly appreciate it as Im trying to find the right ETFs to get into and hopefully grow over timeThank you in advance'
'What 3 ETFs are good to diversify with and buy into']
(0, 517) 1
(0, 1007) 1
其中 y-train 为 1(正)。
这是 y_test 和来自 Kernel SVM 的预测:
print(y_test)
[ 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 -1
1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 1 1 1 1 1 1
1 -1 1 1 1 1 1 -1 1 1 1 1 1 -1 -1 -1 1 -1 1 1 1 -1 1 -1
1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 -1 1 1 1 1 1 1 -1
1 -1 1 1 1 1 1 1 1 1 1 1 1 -1 1 1 1 1 1 -1 1 1 1 1
1 1]
print(clf_svm_pred)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
等等。决策树的相同输出。
难道我做错了什么?
解决方案
我不确定问题的原因是什么,但由于 SVM 模型和 DecisionTreeClassfier 的输出总是输出 1,我建议你尝试一个更复杂的模型,比如 RandomForestClassifier,看看结果如何。
我以前有过类似的经历,无论我如何调整训练超参数,模型总是给出相同的性能指标——这可能是由 2 个概率引起的:
- 我们的数据不适合模型,例如向量中的所有值都为零:[0, 0, 0, 0, 0, 0, 0]
- 我们的模型过于简单,只能进行线性建模,无法学习过于复杂的映射函数。
既然你的 SVM 是用线性内核构建的,你能尝试一个更复杂的模型,看看它会出现什么结果吗?如果你的 X_train_vectors 在矩阵中全为零,你能检查一下吗?
推荐阅读
- c# - 反序列化 JSON Objest 导致所有空字段 C#
- javascript - 单击表格AngularJS中的过滤器
- laravel-5 - 如何使用 Eloquent 计算每一列并显示
- swift - NSOutlineView,如何获取选中的单元格
- android - 生产中的 Firebase 电话身份验证错误
- reactjs - React js - 我希望自动完成应该使用 primaryText 和 SecondaryText 搜索值
- java - Spring Boot MSSQL Kerberos 身份验证
- java - java中方法隐藏概念中“隐藏”一词的含义是什么?
- c# - Unity-C# 实例化不起作用
- arrays - 在 Swift 中将实例变量的属性保存到 Struct 的静态变量数组中