首页 > 解决方案 > 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式的?

问题描述

我正在测试perplexity文本语言模型的度量:

  train_sentences = nltk.sent_tokenize(train_text)
  test_sentences = nltk.sent_tokenize(test_text)

  train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

  test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

  from nltk.lm.preprocessing import padded_everygram_pipeline
  from nltk.lm import MLE,Laplace
  from nltk.lm import Vocabulary

  vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);

  n = 2
  print(train_tokenized_text)
  print(len(train_tokenized_text))
  train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)

  # print(list(vocab),"\n >>>>",list(padded_vocab))
  model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
  # model.fit(train_data, padded_vocab)
  model.fit(train_data, vocab)

  sentences = test_sentences
  print("len: ",len(sentences))
  print("per all", model.perplexity(test_text)) 

当我vocabmodel.fit(train_data, vocab)困惑中使用的print("per all", model.perplexity(test_text))是一个数字(30.2),但如果我使用padded_vocabwhich 有附加<s>并且</s>它打印inf.

标签: pythonnltk

解决方案


困惑的输入是 ngram 中的文本,而不是字符串列表。您可以通过运行来验证相同的

for x in test_text:
    print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])

您应该看到标记(ngrams)都是错误的。

如果您在测试数据中的单词超出(训练数据的)词汇,您仍然会感到困惑

train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)

train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']

train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]

test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary

n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab) 

test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all", model.perplexity(test))

# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
    print("per all [oov]", model.perplexity(test))

推荐阅读