首页 > 解决方案 > 垃圾邮件过滤:删除停用词

问题描述

我创建了两个列表:l1 是我的主要列表,l2 是包含某些停用词的列表。我打算从 l1 的第二个嵌套列表中删除 l2 中的停用词。但是,代码似乎效率不高,只删除了一个停用词,其余的保留在 l1 中。这是 l1 的样子:

[['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection.....]],...]

这是 l2 的样子:

['a', ' able', ' about', ' across', ' after', ' all', ' almost', ' also', ' am', ' among', ' an', ' and', ' any',....]

这是我尝试过的:

for i in l1:
   i[1] = i[1].lower()
   i[1] = i[1].split()
   for j in i[1]:
      if j in l2:
         i[1].remove(j)

标签: python

解决方案


如果您不想重新发明轮子,可以使用nltk标记文本并删除停用词:

import nltk
data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in nltk.corpus.stopwords.words('english')]
    print(filtered_tokens)

输出应该是:

>>> [',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'landline', '.', '£5000', 'cash', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

如果您仍想使用自己的停用词列表,则以下内容应该可以为您解决问题:

import nltk

data = [['ham', 'And how you will do that, princess? :)'], ['spam', 'Urgent! Please call 09061213237 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection']]
stopwords = ['a', 'able', 'about', 'across', 'after', 'all', 'almost', 'also', 'am', 'among', 'an', 'and', 'any' ]

for text in (label_text[1] for label_text in data): 
    filtered_tokens = [token for token in nltk.word_tokenize(text) if token.lower() not in stopwords]
    print(filtered_tokens)

>>> ['how', 'you', 'will', 'do', 'that', ',', 'princess', '?', ':', ')']
>>> ['Urgent', '!', 'Please', 'call', '09061213237', 'from', 'landline', '.', '£5000', 'cash', 'or', 'luxury', '4*', 'Canary', 'Islands', 'Holiday', 'await', 'collection']

推荐阅读