首页 > 解决方案 > 文件中多个单词/值的总出现次数

问题描述

我有一个包含大量文本的文件。我正在阅读此文件,并打算打印出引用圣经段落的次数,并以“Verse”开头的行注明。然后我想打印出引用,然后是出现次数。

示例文件:

Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke

结果应该是这样的:

{'5:2': 2, '10:5': 1, '3:16': 1}

我正在使用字典来制作键:参考值:出现次数。该脚本很短,并提供:

fileHandle = open("sj", "r")
occurrences = dict()
references = []
#Go through each line if it is a verse line (starts with "Verse"), seperate the reference and count the reference
for line in fileHandle:
    if "Verse" in line:
        verseLine = line.split()
        references.append(verseLine[2]) #Reference is always 3rd index
        for reference in references:
            if reference not in occurrences:
                occurrences[reference] = 1
            else:
                occurrences[reference] = occurrences[reference] + 1
print(" References printed below ")
print(references)
print(" Occerances printed below ")
print(occurrences)

问题:引用的计数方式很奇怪。这是我的输出:

{'5:2': 5, '10:5': 3, '3:16': 2}

显然这是不对的!这与else:我认为的陈述有关。例如,如果我将其更改为occurrences[reference] = occurrences[reference] + 2(注意 1 更改为 2),那么我希望结果会翻倍。但他们不会:

{'5:2': 9, '10:5': 5, '3:16': 3}

为什么这个计数不正确?

标签: pythonlistdictionary

解决方案


另一个版本使用reand collections.Counter

data = '''Verse- Matthew 5:2
Commentary- Matthew
Verse- Matthew 10:5
Verse- John 3:16
Commentary- John
Verse- Luke 5:2
Commentary- Luke'''

import re
from collections import Counter

c = Counter( re.findall(r'^Verse.*?(\d+:\d+)$', data, flags=re.M) )
print(dict(c))

印刷:

{'5:2': 2, '10:5': 1, '3:16': 1}

推荐阅读