首页 > 解决方案 > Why is this TF-IDF sentiment analysis classifier performing so well?


Jupter Notebook

The last confusion matrix is for the test set. Is this a case of overfitting with logistic regression? Because even when not pre-processing the text much (including emoticons, punctuation) the accuracy is still very good. Good anyone give some help/advice?

标签: scikit-learnnlplogistic-regressiontf-idf


You are performing the TfidfVectorizer on whole data before train_test_split which may be a reason for increased performance due to "data leakage". Since the TfidfVectorizer is learning the vocabulary on your whole data, it is:

  • including words in vocabulary that are not present in train and only present in test (out-of-bag words)
  • adjusting the tf-idf scores based on data from test words also

Try the following:

tweets_train, tweets_test, y_train, y_test = train_test_split(reviews['text'].tolist(), 

X_train = v.fit_transform(tweets_train)
X_test = v.transform(tweets_test)

And then check the performance.

Note: This may not be the only reason for the performance. Or maybe the dataset is such that simple tf-idf works well for it.
