首页 > 解决方案 > 对于大文件,我构建字典的方法花费的时间太长。有什么更好的方法呢?

问题描述

def get_word_frequencys(words):

    """given a list of words, returns a dictionary of the words,
    and their frequencys"""

    words_and_freqs = {}
    for word in words:
        words_and_freqs[word] = words.count(word)
    return words_and_freqs

上述函数适用于小文件,但是,我需要它处理 264505 字长的文件,目前,我的程序需要几分钟才能处理这种大小的文件。

如何以更有效的方式构建字典?

所有相关代码:

def main(words):
    """
    given lots of words do things
    """
    words_and_frequencys = get_word_frequencys(words)

    print("loaded ok.")
    print()
    print_max_frequency(words, words_and_frequencys)


def get_word_frequencys(words):
    """given a list of words, returns a dictionary of the words,
    and their frequencys"""
    words_and_freqs = {}
    for word in words:
        words_and_freqs[word] = words.count(word)
    return words_and_freqs      


def print_max_frequency(words, words_and_frequencys):
    """given a dict of words and their frequencys,
    prints the max frequency of any one word"""
    max_frequency = 0
    for word in words:
        if words_and_frequencys.get(word) > max_frequency:
            max_frequency = words_and_frequencys.get(word)
    print(" " + "Maximum frequency = {}".format(max_frequency)) 

请注意那些建议使用 Counter 而不是 Count() 的人,我不允许导入除 os 和 re 之外的任何模块。

标签: pythonpython-3.x

解决方案


每次调用count列表时,都会遍历整个内容(花费 O(N) 时间)。由于您对列表中的每个单词都执行此操作,因此您的整个操作需要 O(N**2) 时间。你可以做得更好。

与其计算您刚刚看到的单词在列表中其他地方出现的次数,不如只计算您在迭代中看到的一次出现的次数?如果您稍后看到更多副本,您可以更新计数。由于这只对每个单词做少量的工作,总运行时间将是线性的而不是二次的。

for word in words:
    words_and_freqs[word] = words_and_freqs.get(word, 0) + 1

如果您不喜欢使用dict.get,则可以改为使用显式if语句来检查当前单词是否曾出现过:

for word in words:
    if word in words_and_freqs:
        words_and_freqs[word] += 1
    else:
        words_and_freqs[word] = 1

推荐阅读