python - NLTK norpus 无法读取文本文件
问题描述
我有一个像这样的示例 python 脚本
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(
nltk.corpus.genesis.words('/Users/anonymous/Desktop/text.txt')
)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 5)
我无法运行它,因为我遇到了以下错误:
Traceback (most recent call last):
File "text2tag.py", line 10, in <module>
nltk.corpus.genesis.words('/Users/anonymous/Desktop/text.txt')
File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/collocations.py", line 178, in from_words
for window in ngrams(words, window_size, pad_right=True):
File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/util.py", line 525, in ngrams
next_item = next(sequence)
File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
tokens = self.read_block(self._stream)
File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/corpus/reader/plaintext.py", line 134, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/tokenize/regexp.py", line 133, in tokenize
return self._regexp.findall(text)
TypeError: cannot use a string pattern on a bytes-like object
解决方案
看来您正在尝试使用自己的单词文件代替nltk.corpus.genesis.words
import nltk
from nltk.collocations import *
with open('file.txt', 'r') as f:
tokens = f.read().split()
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 5)
推荐阅读
- rdf - 允许物化的 RDF 三重存储
- excel - VBA:清除多个不连续的单元格
- python-3.x - 如何避免读取 pandas.read_excel 中的空行
- laravel - 运行作业的用户与网络服务器(apache)用户之间的 Laravel 文件权限不兼容
- c - 未声明的 'yylex' 和 'yyin'
- javascript - 如何从Ruby变量存储在Javascript中的localStorage
- android - 使用导航组件在不同图形之间导航
- tfs - Azure Devops 指定较短的路径错误
- ios - 如何使用图表 (IOS) 在 X 轴上显示每个月
- r - 在 R 中自动选择最优 GARCH 模型