scikit-learn - Why is this TF-IDF sentiment analysis classifier performing so well?
问题描述
The last confusion matrix is for the test set. Is this a case of overfitting with logistic regression? Because even when not pre-processing the text much (including emoticons, punctuation) the accuracy is still very good. Good anyone give some help/advice?
解决方案
You are performing the TfidfVectorizer
on whole data before train_test_split
which may be a reason for increased performance due to "data leakage". Since the TfidfVectorizer
is learning the vocabulary on your whole data, it is:
- including words in vocabulary that are not present in train and only present in test (
out-of-bag
words) - adjusting the
tf-idf
scores based on data from test words also
Try the following:
tweets_train, tweets_test, y_train, y_test = train_test_split(reviews['text'].tolist(),
reviews['airline_sentiment'],
test_size=0.3,
random_state=42)
X_train = v.fit_transform(tweets_train)
X_test = v.transform(tweets_test)
And then check the performance.
Note: This may not be the only reason for the performance. Or maybe the dataset is such that simple tf-idf works well for it.
推荐阅读
- javascript - 如何记录事件对象并在浏览器上查看其所有属性
- node.js - 使用 NodeJs 在 PDF 中添加多个数字签名
- html - 悬停后div项目消失
- vue.js - v-treeview 允许只选择一个值并获取所选值的 ID
- ruby-on-rails - 为什么硒会抛出 Net::Read::Timeout 错误
- python - SQLAlchemy 公用表表达式返回值
- python - 取消透视多个变量 Pandas Dataframe
- python - 计算每一轮的胜率
- jquery - 如何克服 jQuery 2.1.1 Android 应用程序漏洞?
- c - 使用 _mm256_load_ps 时出现分段错误