首页 > 解决方案 > Why tokenize/preprocess words for language analysis?

问题描述

I am currently working on a Python tweet analyser and part of this will be to count common words. I have seen a number of tutorials on how to do this, and most tokenize the strings of text before further analysis.

Surely it would be easier to avoid this stage of preprocessing and count the words directly from the string - so why do this?

标签: pythonnltktweepyanalysis

解决方案


尝试使用这句话:

text = "We like the cake you did this week, we didn't like the cakes you cooked last week"

无需 nltk 令牌直接计数:

Counter(text.split())

返回:

Counter({'We': 1,
     'cake': 1,
     'cakes': 1,
     'cooked': 1,
     'did': 1,
     "didn't": 1,
     'last': 1,
     'like': 2,
     'the': 2,
     'this': 1,
     'we': 1,
     'week': 1,
     'week,': 1,
     'you': 2})

我们看到我们对结果不满意。did 和 did not(这是 did not 的缩写)被视为不同的词,“week”和“week”也是如此,

当您使用 nltk 进行标记时,此问题已得到修复(拆分实际上是一种简单的标记方式):

Counter(nltk.word_tokenize(text))

退货

Counter({',': 1,
     'We': 1,
     'cake': 1,
     'cakes': 1,
     'cooked': 1,
     'did': 2,
     'last': 1,
     'like': 2,
     "n't": 1,
     'the': 2,
     'this': 1,
     'we': 1,
     'week': 2,
     'you': 2})

如果你想把 'cake' 和 'cakes' 算作同一个词,你也可以 lemmatize :

Counter([lemmatizer.lemmatize(w).lower() for w in nltk.word_tokenize(text)])

退货

Counter({',': 1,
     'cake': 2,
     'cooked': 1,
     'did': 2,
     'last': 1,
     'like': 2,
     "n't": 1,
     'the': 2,
     'this': 1,
     'we': 2,
     'week': 2,
     'you': 2})

推荐阅读