首页 > 解决方案 > 删除以 NLTK 开头的句子的停用词

问题描述

我正在尝试从文本输入中删除所有停用词。下面的代码删除了所有停用词,除了一个句子开头的停用词。

我如何删除这些词?

from nltk.stem.wordnet import WordNetLemmatizer

from nltk.corpus import stopwords
stopwords_nltk_en = set(stopwords.words('english'))

from string import punctuation
exclude_punctuation = set(punctuation)

stoplist_combined = set.union(stopwords_nltk_en, exclude_punctuation)

def normalized_text(text):
   lemma = WordNetLemmatizer()
   stopwords_punctuations_free = ' '.join([i for i in text.lower().split() if i not in stoplist_combined])
   normalized = ' '.join(lemma.lemmatize(word) for word in stopwords_punctuations_free.split())
return normalized


sentence = [['The birds are always in their house.'], ['In the hills the birds nest.']]

for item in sentence:
  print (normalized_text(str(item)))

OUTPUT: 
   the bird always house 
   in hill bird nest

标签: pythonpython-3.xnltk

解决方案


罪魁祸首是这行代码:

print (normalized_text(str(item)))

如果您尝试打印列表str(item)的第一个元素sentence,您将获得:

['The birds are always in their house.']

然后,降低和分裂变为:

["['the", 'birds', 'are', 'always', 'in', 'their', "house.']"]

如您所见,第一个元素['the与停用词不匹配the

解决方案:用于''.join(item)将项目转换为str


评论后编辑

在文本字符串内部仍然有一些顶点'。要解决,请调用normalizedas:

for item in sentence:
    print (normalized_text(item))

然后,导入正则表达式模块import re并更改:

text.lower().split()

和:

re.split('\'| ', ''.join(text).lower())

推荐阅读