python - 如何使用 if 语句更新字典?
问题描述
我想创建一个直方图,它是一个字典,显示每个字长在输入文本中有多少具有该长度的字。到目前为止,我已经设法创建了一个包含所有可能字长的字典,但我似乎无法更新字典。我被错误困住了:完整的 Python 回溯:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-36-4ae1bb3ffd5e> in <module>
----> 1 text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt")
<timed exec> in text2wordlengthPDF(text)
TypeError: cannot unpack non-iterable int object
我的代码如下所示:
def text2wordlengthPDF(text):
'''Read in the text document `text`, tokenize it using re.split and regex \W+, and create
the histogram of wordlenghts using the Counter method. Return this histogram.
The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''
#.read() is a way to retrieve strings from file object
tokens = re.split(r'\W+', open(text, "r").read())
tokens_counter = Counter(tokens)
# create list of wordlength for items in Counter
wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))
# Create dictionary with wordlength as key and occurrence as value
dict_histogram = {i:0 for i in wordlength}
for k,v in dict_histogram.items():
if (k == len(w) for w in tokens_counter):
k[v] = +1
dict_histogram
print(dict_histogram)
# run and plot
#pdf= text2wordlengthPDF(linktopdf())
#pdfS= pd.Series(pdf).sort_index()
#pdfS[pdfS>5].plot(kind='bar' ) #plot only the wordlenghts occurring more then 5 times.
#print(pdf) ```
#This is where I run my code with the input text
text2wordlengthPDF("R095-Big-data-vrije-veilige-samenleving.txt")
解决方案
这部分
for k,v in dict_histogram.items():
if (k == len(w) for w in tokens_counter):
k[v] = +1
没有意义。k
,键(每个单词的长度)不是'the'字典。(此外,您可能的意思是k[v] += 1
。)
更正确地重写它会导致
for k,v in dict_histogram.items():
if (k == len(w) for w in tokens_counter):
dict_histogram[k] += v
但这不起作用。(我实际上很惊讶该if
行不是一个完整的语法错误。它是有效的语法吗?)但是该值v
仍然保留原始键值,即0
(来自初始化)。你想要len(w)
那里;但你不能,因为它只是上一行中的本地化变量。
完全重写该部分导致我这样做:
import re
from collections import Counter
def text2wordlengthPDF(text):
'''Read in the text document `text`, tokenize it using re.split and regex \W+, and create
the histogram of wordlenghts using the Counter method. Return this histogram.
The histogram is a dict showing for each wordlength how many words with that length are in the input text.'''
#.read() is a way to retrieve strings from file object
tokens = re.split(r'\W+', open(text, "r", encoding="utf8").read())
tokens_counter = Counter(tokens)
# create list of wordlength for items in Counter
wordlength = list(dict.fromkeys([len(w) for w in tokens_counter ]))
# Create dictionary with wordlength as key and occurrence as value
dict_histogram = {i:0 for i in wordlength}
for key,occurrence in tokens_counter.items():
dict_histogram[len(key)] += occurrence
pprint.pprint(sorted(dict_histogram.items()), compact=True)
text2wordlengthPDF("pslrm.txt")
从计数器中获取key
as单词,因此它的值tokens_counter[key]
是出现的次数。items()
这些都可以通过计数器的功能进行迭代。
然后,这个数字被添加到字典中,字典由每个单词的长度索引。最后sorted
按升序列出出现的长度:
[(0, 2), (1, 57262), (2, 54080), (3, 95251), (4, 132448), (5, 29969),
(6, 62938), (7, 46593), (8, 23929), (9, 14645), (10, 12943), (11, 10708),
(12, 2940), (13, 2742), (14, 1807), (15, 827), (16, 312), (17, 17965),
(18, 91), (19, 118), (20, 147), (21, 24), (22, 35), (23, 7), (24, 13), (25, 1),
(26, 24), (28, 1), (29, 24), (34, 1)]
(那个 34 个字符长的“单词”恰好是我的测试语料库中的一个随机十六进制字符串,PostScript 语言参考手册:4c47494b4d4c524c4d50535051554c5152
。那些其他过长的单词同样令人沮丧,唉。)
推荐阅读
- vue.js - Vuetify 中显示服务器端验证错误
- qt - 杀死已停止并恢复的 QProcess
- c# - 无法创建 OneDrive 文件更改通知订阅
- delphi - Delphi中的“&”运算符是什么意思?
- php - 使用 PHP 使用表中的值在文件夹中重命名文件
- ruby-on-rails - 未知属性“answered_questions_attributes”
- android - 如何将 ApiKey 和 Token 添加到 HttpURLConnection
- django - 如何在操作发生后而不是在注册后征求 Google 日历的同意
- vba - 文本/值的VBA测试格式
- laravel - Laravel 关系 - 在刀片中选择具有相同列值的行