首页 > 解决方案 > 代币化列表理解

问题描述

我创建此代码的目的是使用大量语料库样本来确定在应用数字和案例规范化时词汇量减少的程度。

def vocabulary_size(sentences):
    tok_counts = {}
    for sentence in sentences: 
        for token in sentence:
            tok_counts[token]=tok_counts.get(token,0)+1
    return len(tok_counts.keys())

rcr = ReutersCorpusReader()    

sample_size = 10000

raw_sentences = rcr.sample_raw_sents(sample_size)
tokenised_sentences = [word_tokenize(sentence) for sentence in raw_sentences]

lowered_sentences = [sentence.lower() for sentence in tokenised_sentences] # something going wrong here
normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences] # something going wrong here

raw_vocab_size = vocabulary_size(tokenised_sentences)
normalised_vocab_size = vocabulary_size(normalised_sentences)
print("Normalisation produced a {0:.2f}% reduction in vocabulary size from {1} to {2}".format(
    100*(raw_vocab_size - normalised_vocab_size)/raw_vocab_size,raw_vocab_size,normalised_vocab_size))

尽管就目前而言,它仅按原样打印每个单独的字符。我想我已将问题本地化为 2 行。List 没有属性 .lower() 所以我不确定如何替换它。

我还认为我可能必须将 lower_sentences 输入到我的 normalised_sentences 中。

这是我的规范化功能:

def normalise(token):
    print(["NUM" if token.isdigit() 
    else "Nth" if re.fullmatch(r"[\d]+(st|nd|rd|th)", token) 
    else token for token in token])  

虽然我什至可能不打算使用这个特定的规范化功能。也许我以错误的方式攻击这个,我很抱歉,我会回来提供更多信息。

标签: pythonpython-3.xtokenlist-comprehension

解决方案


我看到了一些可以为您解决问题的事情。

 lowered_sentences = [tokenised_sentences.lower() for sentence in tokenised_sentences] # something going wrong here
 normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here

在这里,您忘记实际使用正确的变量,您可能的意思是

 lowered_sentences = [sentence.lower() for sentence in tokenised_sentences]
 normalised_sentences = [normalise(sentence) for sentence in tokenised_sentences]

也因为列表没有功能lower(),你必须为每个句子中的每个标记应用它,即

 lowered_sentences = [[token.lower() for token in sentence] for sentence in tokenised_sentences]

Also, your normalise(token) is not returning anything, just using print. So the list comprehension

 normalised_sentences = [normalise(tokenised_sentences) for sentence in tokenised_sentences] # something going wrong here

does not produce a list of anything but None.

I'd suggest you to refrain from using list comprehensions, and start off with using normal for loops until you have your algorithm in place, and convert it later if speed is needed.


推荐阅读