python - 代币化列表理解
问题描述
我创建此代码的目的是使用大量语料库样本来确定在应用数字和案例规范化时词汇量减少的程度。
def vocabulary_size(sentences):
tok_counts = {}
for sentence in sentences:
for token in sentence:
tok_counts[token]=tok_counts.get(token,0)+1
return len(tok_counts.keys())
rcr = ReutersCorpusReader()
sample_size = 10000
raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here
raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))
尽管就目前而言,它仅按原样打印每个单独的字符。我想我已将问题本地化为 2 行。List 没有属性 .lower() 所以我不确定如何替换它。
我还认为我可能必须将 lower_sentences 输入到我的 normalised_sentences 中。
这是我的规范化功能:
def normalise(token):
print(["NUM" if token.isdigit()
else "Nth" if re.fullmatch(r"[\d]+(st|nd|rd|th)", token)
else token for token in token])
虽然我什至可能不打算使用这个特定的规范化功能。也许我以错误的方式攻击这个,我很抱歉,我会回来提供更多信息。
解决方案
我看到了一些可以为您解决问题的事情。
lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
在这里,您忘记实际使用正确的变量,您可能的意思是
lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]
也因为列表没有功能lower()
,你必须为每个句子中的每个标记应用它,即
lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]
Also, your normalise(token)
is not returning anything, just using print. So the list comprehension
normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here
does not produce a list of anything but None
.
I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.
推荐阅读
- r - 使用 dplyr 重命名除列出的列名之外的所有列名后缀?
- ios - 当我们退出应用程序而不杀死应用程序 Swift 4 时弹出不会再次出现
- python - 使用 numpy.sqrt 可能会产生警告“无效值”的负数以外的输入?
- angular - 没有将“exportAs”设置为“stat.dpndcyDt”的指令
- elasticsearch - 使用serilog直接将日志写入elasticsearch是不是一个好主意
- javascript - 更改宽度后元素消失
- r - 如何获取包含 NA OR 值 < 0 的两列矩阵中的行列表?
- matlab - 在 Matlab 中分配带偏差的有符号整数
- javascript - AngularJs 指令在页面上创建固定 div
- wordpress - WordPress:在上传文件夹中删除用户及其文件