python - Why tokenize/preprocess words for language analysis?
问题描述
I am currently working on a Python tweet analyser and part of this will be to count common words. I have seen a number of tutorials on how to do this, and most tokenize the strings of text before further analysis.
Surely it would be easier to avoid this stage of preprocessing and count the words directly from the string - so why do this?
解决方案
尝试使用这句话:
text = "We like the cake you did this week, we didn't like the cakes you cooked last week"
无需 nltk 令牌直接计数:
Counter(text.split())
返回:
Counter({'We': 1,
'cake': 1,
'cakes': 1,
'cooked': 1,
'did': 1,
"didn't": 1,
'last': 1,
'like': 2,
'the': 2,
'this': 1,
'we': 1,
'week': 1,
'week,': 1,
'you': 2})
我们看到我们对结果不满意。did 和 did not(这是 did not 的缩写)被视为不同的词,“week”和“week”也是如此,
当您使用 nltk 进行标记时,此问题已得到修复(拆分实际上是一种简单的标记方式):
Counter(nltk.word_tokenize(text))
退货
Counter({',': 1,
'We': 1,
'cake': 1,
'cakes': 1,
'cooked': 1,
'did': 2,
'last': 1,
'like': 2,
"n't": 1,
'the': 2,
'this': 1,
'we': 1,
'week': 2,
'you': 2})
如果你想把 'cake' 和 'cakes' 算作同一个词,你也可以 lemmatize :
Counter([lemmatizer.lemmatize(w).lower() for w in nltk.word_tokenize(text)])
退货
Counter({',': 1,
'cake': 2,
'cooked': 1,
'did': 2,
'last': 1,
'like': 2,
"n't": 1,
'the': 2,
'this': 1,
'we': 2,
'week': 2,
'you': 2})
推荐阅读
- python - 不处理词云停用词
- sql - Redshift 跳过 split_part() 的第一个字符
- javascript - Formik 的表单无法识别 Material UI 组件的文本字段值?
- r - 以编程方式循环列表
- django - 异常值:__init__() 得到了一个意外的关键字参数“limit_choices_to”
- python-3.x - 显示饼图图例,不显示条形图图例,仅更改为图形类型
- c# - 获取两个日期之间的周期
- c++ - 带有用户输入的字符串金字塔
- database - 为什么JDBC事务管理需要手动回滚?
- javascript - 我在哪里可以找到 cdv-plugin-paypal-mobile-sdk.js 和 paypal-mobile-js-helper.js 文档