首页 > 解决方案 > 如何使用 if 语句更新字典?

问题描述

我想创建一个直方图,它是一个字典,显示每个字长在输入文本中有多少具有该长度的字。到目前为止,我已经设法创建了一个包含所有可能字长的字典,但我似乎无法更新字典。我被错误困住了:完整的 Python 回溯:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-4ae1bb3ffd5e> in <module>
----> 1 text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt")

<timed exec> in text2wordlengthPDF(text)

TypeError: cannot unpack non-iterable int object

我的代码如下所示:

def text2wordlengthPDF(text):
    '''Read in the text document `text`, tokenize it using re.split and regex \W+, and create 
    the histogram of wordlenghts using the Counter method. Return this histogram. 
    The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''

    #.read() is a way to retrieve strings from file object
    tokens = re.split(r'\W+', open(text, "r").read())
    tokens_counter = Counter(tokens)

    # create list of wordlength for items in Counter
    wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))

    # Create dictionary with wordlength as key and occurrence as value
    dict_histogram = {i:0 for i in wordlength}
    for k,v in dict_histogram.items():
        if (k == len(w) for w in tokens_counter):
            k[v] = +1
    dict_histogram 

    print(dict_histogram)

# run and plot    
#pdf= text2wordlengthPDF(linktopdf())
#pdfS= pd.Series(pdf).sort_index()

#pdfS[pdfS>5].plot(kind='bar' ) #plot only the wordlenghts occurring more then 5 times.
#print(pdf) ```

#This is where I run my code with the input text
text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt") 


标签: pythonpython-3.x

解决方案


这部分

for k,v in dict_histogram.items():
    if (k == len(w) for w in tokens_counter):
        k[v] = +1

没有意义。k(每个单词的长度)不是'the'字典。(此外,您可能的意思是k[v] += 1。)

更正确地重写它会导致

for k,v in dict_histogram.items():
    if (k == len(w) for w in tokens_counter):
        dict_histogram[k] += v

但这不起作用。(我实际上很惊讶该if行不是一个完整的语法错误。它是有效的语法吗?)但是该值v仍然保留原始键值,即0(来自初始化)。你想要len(w)那里;但你不能,因为它只是上一行中的本地化变量。

完全重写该部分导致我这样做:

import re
from collections import Counter

def text2wordlengthPDF(text):
    '''Read in the text document `text`, tokenize it using re.split and regex \W+, and create 
    the histogram of wordlenghts using the Counter method. Return this histogram. 
    The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''

    #.read() is a way to retrieve strings from file object
    tokens = re.split(r'\W+', open(text, "r", encoding="utf8").read())
    tokens_counter = Counter(tokens)

    # create list of wordlength for items in Counter
    wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))

    # Create dictionary with wordlength as key and occurrence as value
    dict_histogram = {i:0 for i in wordlength}
    for key,occurrence in tokens_counter.items():
        dict_histogram[len(key)] += occurrence

    pprint.pprint(sorted(dict_histogram.items()), compact=True)

text2wordlengthPDF("pslrm.txt") 

从计数器中获取keyas单词,因此它的值tokens_counter[key]是出现的次数。items()这些都可以通过计数器的功能进行迭代。
然后,这个数字被添加到字典中,字典由每个单词的长度索引。最后sorted按升序列出出现的长度:

[(0, 2), (1, 57262), (2, 54080), (3, 95251), (4, 132448), (5, 29969),
 (6, 62938), (7, 46593), (8, 23929), (9, 14645), (10, 12943), (11, 10708),
 (12, 2940), (13, 2742), (14, 1807), (15, 827), (16, 312), (17, 17965),
 (18, 91), (19, 118), (20, 147), (21, 24), (22, 35), (23, 7), (24, 13), (25, 1),
 (26, 24), (28, 1), (29, 24), (34, 1)]

(那个 34 个字符长的“单词”恰好是我的测试语料库中的一个随机十六进制字符串,PostScript 语言参考手册:4c47494b4d4c524c4d50535051554c5152。那些其他过长的单词同样令人沮丧,唉。)


推荐阅读