python - 为什么填充词汇的困惑对于 nltk.lm bigram 来说是不定式的?
问题描述
我正在测试perplexity
文本语言模型的度量:
train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)
train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary
vocab = Vocabulary(nltk.tokenize.word_tokenize(train_text),1);
n = 2
print(train_tokenized_text)
print(len(train_tokenized_text))
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
# print(list(vocab),"\n >>>>",list(padded_vocab))
model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
# model.fit(train_data, padded_vocab)
model.fit(train_data, vocab)
sentences = test_sentences
print("len: ",len(sentences))
print("per all", model.perplexity(test_text))
当我vocab
在model.fit(train_data, vocab)
困惑中使用的print("per all", model.perplexity(test_text))
是一个数字(30.2
),但如果我使用padded_vocab
which 有附加<s>
并且</s>
它打印inf
.
解决方案
困惑的输入是 ngram 中的文本,而不是字符串列表。您可以通过运行来验证相同的
for x in test_text:
print ([((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in x])
您应该看到标记(ngrams)都是错误的。
如果您在测试数据中的单词超出(训练数据的)词汇,您仍然会感到困惑
train_sentences = nltk.sent_tokenize(train_text)
test_sentences = nltk.sent_tokenize(test_text)
train_sentences = ['an apple', 'an orange']
test_sentences = ['an apple']
train_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in train_sentences]
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE,Laplace
from nltk.lm import Vocabulary
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, train_tokenized_text)
model = MLE(n)
# fit on padded vocab that the model know the new tokens added to vocab (<s>, </s>, UNK etc)
model.fit(train_data, padded_vocab)
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all", model.perplexity(test))
# out of vocab test data
test_sentences = ['an ant']
test_tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent)))
for sent in test_sentences]
test_data, _ = padded_everygram_pipeline(n, test_tokenized_text)
for test in test_data:
print("per all [oov]", model.perplexity(test))
推荐阅读
- x86-64 - Qemu 的行为非常奇怪,就好像它是 32 位一样
- reactjs - 'AsyncThunkAction 类型的参数
' 不可分配给“AnyAction”类型的参数 - javascript - html中的弹出功能
- html - SVG gradientUnits="userSpaceOnUse" 不适用于路径
- java - 尽管省电已关闭,但省电模式的不良影响
- reactjs - react-slick 重叠导航栏下拉菜单。我如何在光滑的滑块上显示下拉菜单?
- asp.net-web-api - EF中的一对多关系
- javascript - 在 Rails 6 中使用 Webpack 进行条件图像导入
- python - 如何通过内容python3获取单元格的坐标?
- android - 如何在卡片视图中水平放置 Imageview 和 textview,我在线性 lyt 中使用了 android:orientation="horizontal"> 但得到垂直对齐