首页 > 解决方案 > Why is this TF-IDF sentiment analysis classifier performing so well?

问题描述

Jupter Notebook

The last confusion matrix is for the test set. Is this a case of overfitting with logistic regression? Because even when not pre-processing the text much (including emoticons, punctuation) the accuracy is still very good. Good anyone give some help/advice?

标签: scikit-learnnlplogistic-regressiontf-idf

解决方案


You are performing the TfidfVectorizer on whole data before train_test_split which may be a reason for increased performance due to "data leakage". Since the TfidfVectorizer is learning the vocabulary on your whole data, it is:

  • including words in vocabulary that are not present in train and only present in test (out-of-bag words)
  • adjusting the tf-idf scores based on data from test words also

Try the following:

tweets_train, tweets_test, y_train, y_test = train_test_split(reviews['text'].tolist(), 
                                                  reviews['airline_sentiment'], 
                                                  test_size=0.3, 
                                                  random_state=42)

X_train = v.fit_transform(tweets_train)
X_test = v.transform(tweets_test)

And then check the performance.

Note: This may not be the only reason for the performance. Or maybe the dataset is such that simple tf-idf works well for it.


推荐阅读