首页 > 解决方案 > NLTK norpus 无法读取文本文件

问题描述

我有一个像这样的示例 python 脚本

import nltk

from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(
   nltk.corpus.genesis.words('/Users/anonymous/Desktop/text.txt')
)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 5)

我无法运行它,因为我遇到了以下错误:

Traceback (most recent call last):
  File "text2tag.py", line 10, in <module>
    nltk.corpus.genesis.words('/Users/anonymous/Desktop/text.txt')
  File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/collocations.py", line 178, in from_words
    for window in ngrams(words, window_size, pad_right=True):
  File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/util.py", line 525, in ngrams
    next_item = next(sequence)
  File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/corpus/reader/util.py", line 296, in iterate_from
    tokens = self.read_block(self._stream)
  File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/corpus/reader/plaintext.py", line 134, in _read_word_block
    words.extend(self._word_tokenizer.tokenize(stream.readline()))
  File "/Users/anonymous/.virtualenvs/playground/lib/python3.6/site-packages/nltk/tokenize/regexp.py", line 133, in tokenize
    return self._regexp.findall(text)
TypeError: cannot use a string pattern on a bytes-like object

标签: pythonnlpnltkcorpus

解决方案


看来您正在尝试使用自己的单词文件代替nltk.corpus.genesis.words

import nltk
from nltk.collocations import *

with open('file.txt', 'r') as f:
    tokens = f.read().split()

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 5)

推荐阅读