首页 > 解决方案 > 为什么我所有的分类准确度分数都一样?

问题描述

我正在运行几个机器学习模型来找到准确度得分最高的模型,但是,所有的准确度得分都是完全相同的。我在社交媒体文本上执行了 NLP,并且我正在训练我的模型以根据从 NLTK 确定的情绪来标记情绪。

我使用的是相同的训练集和测试集,但我以前做过这种方法,并且在不同的模型上获得了不同的分数。为什么我的都一样?我是否过度拟合?

这是我要拆分和训练的代码:

submissions_sentiment = submissions_df[["Clean_Body", "Clean_Title", "sentiment_label"]]
dataset = submissions_sentiment

X = dataset.iloc[:, :-1]
y = dataset.iloc[:, -1].values

X_arr = []
for index, row in X.iterrows():
    X_arr.append(row.values)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_arr, y, test_size = 0.2, random_state = 0)

def identity_tokenizer(text):
    return text

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(tokenizer=identity_tokenizer, lowercase=False)

# fit AND transform the model (only for training data)
X_train_vectors = vectorizer.fit_transform(X_train)

# transform the test data
X_test_vectors = vectorizer.transform(X_test)

# Linear SVM

from sklearn import svm

clf_svm = svm.SVC(kernel="linear")

clf_svm.fit(X_train_vectors, y_train)

clf_svm_pred = clf_svm.predict(X_test_vectors)

# Evaluate Model Accuracy
from sklearn.metrics import accuracy_score

accuracy_score(y_test, clf_svm_pred) 
# Output is .86

# Naive Bayes

from sklearn.naive_bayes import GaussianNB

clf_gnb = DecisionTreeClassifier()
clf_gnb.fit(X_train_vectors, y_train)

clf_gnb_pred = clf_gnb.predict(X_test_vectors)

# Evaluate Model Accuracy
accuracy_score(y_test, clf_gnb_pred)
# Output is .86

这是X-train的一个例子:

# Review data ouput
print(X_train_vectors.toarray())
print(X_train[0])
print(X_train_vectors[0])
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
['I started really investing this year and looking for long term holdings After about 5 months or so I have decided to start putting money into ETFs for the time being while I research and learn about companies more For ETFs Im thinking about are the followingVOOQQQIm looking for another ETF that is not apart of Tech to kind of help diversify my holdings I was wondering if XLC would be a good third ETF My plan right now is each month put X amount into a single ETF then the next month put it into the next ETF etc and essentially continously put money into all three ETFs Im in my late 20s and my goal is to hold long term 10  15 years or longer If anyone has suggestions on other ETFs I would greatly appreciate it as Im trying to find the right ETFs to get into and hopefully grow over timeThank you in advance'
 'What 3 ETFs are good to diversify with and buy into']
  (0, 517)  1
  (0, 1007) 1

其中 y-train 为 1(正)。

这是 y_test 和来自 Kernel SVM 的预测:

print(y_test)
[ 1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1 -1
  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1
  1 -1  1  1  1  1  1 -1  1  1  1  1  1 -1 -1 -1  1 -1  1  1  1 -1  1 -1
  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1  1  1  1  1  1  1 -1
  1 -1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1
  1  1]
print(clf_svm_pred)
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]

等等。决策树的相同输出。

难道我做错了什么?

标签: pythonmachine-learningscikit-learn

解决方案


我不确定问题的原因是什么,但由于 SVM 模型和 DecisionTreeClassfier 的输出总是输出 1,我建议你尝试一个更复杂的模型,比如 RandomForestClassifier,看看结果如何。

我以前有过类似的经历,无论我如何调整训练超参数,模型总是给出相同的性能指标——这可能是由 2 个概率引起的:

  1. 我们的数据不适合模型,例如向量中的所有值都为零:[0, 0, 0, 0, 0, 0, 0]
  2. 我们的模型过于简单,只能进行线性建模,无法学习过于复杂的映射函数。

既然你的 SVM 是用线性内核构建的,你能尝试一个更复杂的模型,看看它会出现什么结果吗?如果你的 X_train_vectors 在矩阵中全为零,你能检查一下吗?


推荐阅读