首页 > 解决方案 > 文本文件解析尽可能快

问题描述

我有一个非常大的文件,其中包含如下行:

……

0.040027 abcde 12 34 56 78 90 12 34 56

0.050027 fghil 12 34 56 78 90 12 34 56

0.060027 abcde 12 34 56 78 90 12 34 56

0.070027 fghil 12 34 56 78 90 12 34 56

0.080027 abcde 12 34 56 78 90 12 34 56

0.090027 fghil 12 34 56 78 90 12 34 56

……

我需要以最快的方式拥有如下字典。

我使用以下代码:

ascFile = open('C:\\eample.txt', 'r', encoding='UTF-8')

tag1 = ' a b c d e '

tag2 = ' f g h i l '

tags = [tag1, tag2]

temp = {'k1':[], 'k2':[]}

key_tag = {'k1':tag1, 'k2':tag2 }

t1 = time.time()

for line in ascFile:

    for path, tag in key_tag.items():

        if tag in line:

            columns = line.strip().split(tag, 1)

            temp[path].append([columns[0], columns[-1].replace(' ', '')])

t2 = time.time()

print(t2-t1)

我在 6 秒内解析 360MB 的文件得到以下结果,我想改进时间。

temp = {'k1':[['0.040027', '1234567890123456'], ['0.060027', '1234567890123456'], ['0.080027', '1234567890123456']], 'k2':[['0.050027' 1234567890123456'], ['0.070027', '1234567890123456'], ['0.090027', '1234567890123456']] }

标签: pythonfileparsingtext

解决方案


我假设您在文件中有固定数量的单词作为您的键。用来打断split字符串,然后从拆分列表中直接计算出你的密钥:

import collections

# raw strings don't need \\ for backslash:
FILESPEC = r'C:\example.txt'

lines_by_key = collections.defaultdict(list)

with open(FILESPEC, 'r', encoding='UTF-8') as f:
    for line in f:
        cols = line.split()
        key = ' '.join(cols[1:6])
        pair = (cols[0], ''.join(cols[6:]) # tuple, not list, could be changed
        lines_by_key[key].append(pair)

print(lines_by_key)

推荐阅读