首页 > 解决方案 > 如何从 sklearn TfidfVectorizer 中删除所有非英语标记?

问题描述

TfidfVectorizer(analyzer='word', ngram_range=ngram_range, min_df=0, stop_words=lang)

我正在尝试对我的语料库进行矢量化,但我的语料库同时包含英语和阿拉伯语单词。我想删除阿拉伯语单词。

标签: pythonscikit-learn

解决方案


你可以使用 strip_accents = "ascii" :

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
     'This is the first document. সহজ  نعم فعلا' ,
     'This document is the second document. সহজ نعم فعلا',
     'And this is the third one.',
     'Is this the first document?',
 ]
vectorizer = TfidfVectorizer(strip_accents = "ascii")
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

输出:

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

推荐阅读