首页 > 解决方案 > 使用 Spacy、Bert 时是否需要对文本分类进行停用词去除、词干/词形还原?

问题描述

在使用 Spacy、Bert 或其他高级 NLP 模型获取文本的向量嵌入时,是否需要去除停用词、词干和词形还原?

text="婚礼上的食物非常好吃"

1.因为 Spacy,Bert 在巨大的原始数据集上进行了训练,在使用 bert/spacy 进行文本分类任务生成嵌入之前,对这些文本应用停用词删除、词干和词形还原有什么好处?

2.当我们使用countvectorizer,tfidf vectorizer来嵌入句子时,我可以理解停用词去除,词干和词形还原会很好。

标签: nlpspacytext-classificationbert-language-model

解决方案


You can test to see if doing stemming lemmatization and stopword removal helps. It doesn't always. I usually do if I gonna graph as the stopwords clutter up the results.

A case for not using Stopwords Using Stopwords will provide context to the user's intent, so when you use a contextual model like BERT. In such models like BERT, all stopwords are kept to provide enough context information like the negation words (not, nor, never) which are considered to be stopwords.

According to https://arxiv.org/pdf/1904.07531.pdf

"Surprisingly, the stopwords received as much attention as non-stop words, but removing them has no effect inMRR performances. "


推荐阅读