首页 > 解决方案 > 有没有比 pickle 或常规 Python 文件更快的方法来存储大字典?

问题描述

我想存储一个只包含以下格式数据的字典:

{
    "key1" : True,
    "key2" : True,
    .....
}

换句话说,只是一种快速检查密钥是否有效的方法。我可以通过在一个名为foo的文件中存储一个调用的 dict 来做到这一点bar.py,然后在我的其他模块中,我可以按如下方式导入它:

from bar import foo

或者,我可以将它保存在一个名为 的 pickle 文件中bar.pickle,然后将其导入文件顶部,如下所示:

import pickle  
with open('bar.pickle', 'rb') as f:
    foo = pickle.load(f)

哪种方法是理想且更快的方法?

标签: pythonpickle

解决方案


要添加到@scnerd 的评论,这里是 IPython 中针对不同负载情况的时间安排。

在这里,我们创建一个字典并将其写入 3 种格式:

import random
import json
import pickle

letters = 'abcdefghijklmnopqrstuvwxyz'
d = {''.join(random.choices(letters, k=6)): random.choice([True, False]) 
     for _ in range(100000)}

# write a python file
with open('mydict.py', 'w') as fp:
    fp.write('d = {\n')
    for k,v in d.items():
        fp.write(f"'{k}':{v},\n")
    fp.write('None:False}')

# write a pickle file
with open('mydict.pickle', 'wb') as fp:
    pickle.dump(d, fp)

# write a json file
with open('mydict.json', 'wb') as fp:
    json.dump(d, fp)

蟒蛇文件:

# on first import the file will be cached.  
%%timeit -n1 -r1
from mydict import d

644 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

# after creating the __pycache__ folder, import is MUCH faster
%%timeit
from mydict import d

1.37 µs ± 54.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

泡菜文件:

%%timeit
with open('mydict.pickle', 'rb') as fp:
    pickle.load(fp)

52.4 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

.json 文件:

%%timeit
with open('mydict.json', 'rb') as fp:
    json.load(fp)

81.3 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# here is the same test with ujson
import ujson

%%timeit
with open('mydict.json', 'rb') as fp:
    ujson.load(fp)

51.2 ms ± 304 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

推荐阅读