首页 > 解决方案 > 在 Python 中使用多处理处理大文件:如何每个进程只加载一次资源?

问题描述

Python 的multiprocessing.Pool.imap逐行处理大文件非常方便:

import multiprocessing

def process(line):
    processor = Processor('some-big.model') # this takes time to load...
    return processor.process(line)

if __name__ == '__main__':
    pool = multiprocessing.Pool(4)
    with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
        for processed_line in pool.imap(process, infile):
            outfile.write(processed_line)

如何确保Processor上面示例中的帮助程序只加载一次?如果不求助于涉及队列的更复杂/冗长的结构,这是否可能?

标签: pythonmultiprocessing

解决方案


multiprocessing.Pool允许通过initializerinitarg参数进行资源初始化。我惊讶地发现这个想法是利用全局变量,如下图所示:

import multiprocessing as mp

def init_process(model):
    global processor
    processor = Processor(model) # this takes time to load...

def process(line):
    return processor.process(line) # via global variable `processor` defined in `init_process`

if __name__ == '__main__':
    pool = mp.Pool(4, initializer=init_process, initargs=['some-big.model'])
    with open('lines.txt') as infile, open('processed-lines.txt', 'w') as outfile:
        for processed_line in pool.imap(process, infile):
            outfile.write(processed_line)

multiprocessing.Pool的文档中没有很好地描述这个概念,所以我希望这个例子对其他人有所帮助。


推荐阅读