首页 > 解决方案 > Python - 从并行的多个大文件中读取并单独生成它们

问题描述

我有多个大文件,需要逐行生成它们循环样式。像这样的伪代码:

    def get(self):
        with open(file_list, "r") as files:
            for file in files:
                yield file.readline()

我该怎么做?

标签: pythonfile-iobigdata

解决方案


itertools 文档有几个配方,其中一个非常简洁的循环配方。我还会使用ExitStack多个文件上下文管理器:

from itertools import cycle, islice
from contextlib import ExitStack

# https://docs.python.org/3.8/library/itertools.html#itertools-recipes
def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # Recipe credited to George Sakkis
    num_active = len(iterables)
    nexts = cycle(iter(it).__next__ for it in iterables)
    while num_active:
        try:
            for next in nexts:
                yield next()
        except StopIteration:
            # Remove the iterator we just exhausted from the cycle.
            num_active -= 1
            nexts = cycle(islice(nexts, num_active))

...

def get(self):
    with open(files_list) as fl:
        filenames = [x.strip() for x in fl]
    with ExitStack() as stack:
        files = [stack.enter_context(open(fname)) for fname in filenames]
        yield from roundrobin(*files)

虽然,也许最好的设计是使用控制反转,并提供文件对象的序列作为参数.get,所以调用代码应该注意使用退出堆栈:

class Foo:
    ...
    def get(self, files):
        yield from roundrobin(*files)

# calling code:
foo = Foo() # or however it is initialized

with open(files_list) as fl:
    filenames = [x.strip() for x in fl]
with ExitStack() as stack:
    files = [stack.enter_context(open(fname)) for fname in filenames]
    for line in foo.get(files):
        do_something_with_line(line)

推荐阅读