python - How to compute several hashes at the same time?
问题描述
I want to computes multiple hashes of the same file and save time by multiprocessing.
From what I see, reading a file from ssd is relatively fast, but hash computing is almost 4 times slower. If I want to compute 2 different hashes (md5 and sha), it's 8 times slower. I'd like to able to compute different hashes on different processor cores in parallel (up to 4, depending on the settings), but don't understand how I can get around GIL.
Here are is my current code (hash.py
):
import hashlib
from io import DEFAULT_BUFFER_SIZE
file = 'test/file.mov' #50MG file
def hash_md5(file):
md5 = hashlib.md5()
with open(file, mode='rb') as fl:
chunk = fl.read(DEFAULT_BUFFER_SIZE)
while chunk:
md5.update(chunk)
chunk = fl.read(DEFAULT_BUFFER_SIZE)
return md5.hexdigest()
def hash_sha(file):
sha = hashlib.sha1()
with open(file, mode='rb') as fl:
chunk = fl.read(DEFAULT_BUFFER_SIZE)
while chunk:
sha.update(chunk)
chunk = fl.read(DEFAULT_BUFFER_SIZE)
return sha.hexdigest()
def hash_md5_sha(file):
md5 = hashlib.md5()
sha = hashlib.sha1()
with open(file, mode='rb') as fl:
chunk = fl.read(DEFAULT_BUFFER_SIZE)
while chunk:
md5.update(chunk)
sha.update(chunk)
chunk = fl.read(DEFAULT_BUFFER_SIZE)
return md5.hexdigest(), sha.hexdigest()
def read_file(file):
with open(file, mode='rb') as fl:
chunk = fl.read(DEFAULT_BUFFER_SIZE)
while chunk:
chunk = fl.read(DEFAULT_BUFFER_SIZE)
return
I did some tests and here are the results:
from hash import *
from timeit import timeit
timeit(stmt='read_file(file)',globals=globals(),number = 100)
1.6323043460000122
>>> timeit(stmt='hash_md5(file)',globals=globals(),number = 100)
8.137973076999998
>>> timeit(stmt='hash_sha(file)',globals=globals(),number = 100)
7.1260356809999905
>>> timeit(stmt='hash_md5_sha(file)',globals=globals(),number = 100)
13.740918666999988
This result should be a function, the main script will iterate through file list, and should check different hashes for different files (from 1 to 4). Any ideas how I can achieve that?
解决方案
正如评论中所说的那样,您可以使用concurrent.futures
. 我做了很少的基准测试,最有效的方法是使用ProcessPoolExecutor
. 这是一个例子:
executor = ProcessPoolExecutor(4)
executor.map(hash_function, files)
executor.shutdown()
如果你想看看我的基准,你可以在这里找到它们和结果:
Total using read_file: 10.121980099997018
Total using hash_md5_sha: 40.49621040000693
Total (multi-thread) using read_file: 6.246223400000417
Total (multi-thread) using hash_md5_sha: 19.588415799999893
Total (multi-core) using read_file: 4.099713300000076
Total (multi-core) using hash_md5_sha: 14.448464199999762
我使用了 40 个 300 MiB 的文件进行测试。