首页 > 解决方案 > 匹配两个 csv 文件中的字符串,但第二个文件太大而无法读入列表

问题描述

以下代码适用于高达 300 万条记录的文件大小,但大于此我会耗尽内存,因为我正在将数据读入列表,然后使用列表循环并查找匹配项。

从以前的帖子中,我收集到我应该通过循环一次处理每一行,但找不到任何关于如何从 CSV 文件中一次取一行并通过两个迭代循环处理它的帖子,如下面的代码.

任何帮助将不胜感激。提前谢谢你。

import csv

# open two csv files and read into lists lsts and lstl
with open('small.csv') as s:
    sml = csv.reader(s)
    lsts = [tuple(row) for row in sml]

with open('large.csv') as l:
    lrg = csv.reader(l)
    lstl = [tuple(row) for row in lrg] # can be two large for memory

# find a match and print 
for rows in lsts:
    for rowl in lstl:

        if rowl[7] != rows[0]: # if no match continue
            continue
        else: 
            print(rowl[7], rowl[2]) # when matched print data required from large file

标签: pythoncsv

解决方案


假设你只对小 csv 的一列感兴趣,你可以把它变成一个集合,并与大 csv 逐行比较。集合比较完全替代了外循环

import csv

with open('small.csv') as s:
    sml = csv.reader(s)
    sml_set = set(row[0] for row in sml)

with open('large.csv') as l:
    lrg = csv.reader(l)
    for row in lrg:
        if row[7] in sml_set:
            print(rowl[7], rowl[2])

你可以把它变成一个生成器,比如

def row_matches():
    with open('small.csv') as s:
        sml = csv.reader(s)
        sml_set = set(row[0] for row in sml)

    with open('large.csv') as l:
        lrg = csv.reader(l)
        for row in lrg:
            if row[7] in sml_set:
                yield rowl[7], rowl[2]

推荐阅读