首页 > 解决方案 > NLTK研究课题

问题描述

我正在尝试标记一个句子,然后删除标点符号。

from nltk import word_tokenize
from nltk import re
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
sentence = "what's good people boy's"


tokens = word_tokenize(sentence)
tokens_nopunct = [word.lower() for word in tokens if re.search("\w",word)]
tokens_lemma = [lemmatizer.lemmatize(token) for token in tokens]

print(tokens_lemma)

这给出了输出:

['what', "'s", 'good', 'people', 'boy', "'s"]

但我希望它实现输出:['what', 'good', 'people' , 'boy']

我一直在查看 nltk 和文档,它说 re.search 是您删除标点符号的方式,但它不起作用,我的代码中是否还有其他错误?

标签: pythonpython-3.xnltk

解决方案


这将用于删除所有带有标点符号的元素(不仅仅是's):

import string

punc = set(string.punctuation)
a = ['what', "'s", 'good', 'people', 'boy', "'s"]
without_punc = list(filter(lambda x: x[0] not in punc, a))
print(without_punc)      //['what', 'good', 'people', 'boy']

推荐阅读