首页 > 解决方案 > 删除更改的停用词

问题描述

背景:

1)我有以下代码要stopwords使用 nltk 包删除:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize

your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in stopwords.words('english')]

2)此代码可用于删除此处stopwords所示the的内容:

['dog', 'barks', 'tree', 'sees', 'squirrel']

3)我用下面的代码改变了stopwords这个词:not

to_remove = ['not']
new_stopwords = set(stopwords.words('english')).difference(to_remove)

问题:

4)但是当我使用new_stopwords以下代码时:

your_string = "The dog does not bark at the tree when it sees a squirrel"
tokens = word_tokenize(your_string)
lower_tokens = [t.lower() for t in tokens]
filtered_words = [word for word in lower_tokens if word not in new_stopwords.words('english')]

5)我收到以下错误,因为new_stopwordsset

AttributeError: 'set' object has no attribute 'words' 

问题:

6)如何使用新定义new_stopwords的来获得所需的输出:

['dog', 'not','barks', 'tree', 'sees', 'squirrel']

标签: python-3.xsetnltklist-comprehensionstop-words

解决方案


您非常接近,但是您对错误消息的阅读是错误的:问题不在于“new_stopwordsset”,正如您所说,而是“set没有属性words

哪个,它没有。new_stopwords是一个集合,这意味着您可以直接在列表推导中使用它:

filtered_words = [word for word in lower_tokens if word not in new_stopwords]

您还可以省去修改停用词列表的麻烦,只需使用两个条件:

keep_list = ['not']
filtered_words = [word for word in lower_tokens if (word not in stopwords.words("english")) or (word in keep_list)]

推荐阅读