python - Pandas pd.concat() in separate threads shows no speed-up
问题描述
I am trying to use pandas in a multi-thread environment. I have a few lists of pandas frames (long list, 5000 pandas frames, with dimensions of 300x2500 dimension) which I need to concatenate. Since I have multiple lists, I want to run the concat for each list in an own thread (or use a threadpool, at least to get some parallel processing).
For some reason the processing in my multi-thread setup is identical to single threaded processing. I am wondering if I am doing something systematically wrong.
Here is my code snippet, I use ThreadPoolExecutor to implement parallelization:
def func_merge(the_list, key):
return (key, pd.concat(the_list))
def my_thread_starter():
buffer = {
'A': [df_1, ..., df_5000],
'B': [df_a1, ...., df_a5000]
}
with ThreadPoolExecutor(max_workers=2) as executor:
submitted=[]
for key, df_list in buffer.items():
submitted.append(executor.submit(func_merge, df_list, key = key))
for future in as_completed(submitted):
out = future.result()
// do with results
Is there a trick to use Pandas' concat in separate threads? I would at least expect my CPU utilization to spark when running more threads but it does seem to have any effect. Consequently, the time advantage is zero, too
Does anyone has an idea what the problem could be?
解决方案
Because of the Global Interpreter Lock -GIL), I'm not sure your code is leveraging multi-threading. Basically, ThreadPoolExecutor is useful when workload is not CPU bounded but IO bounded, like making many Web API call at the same time.
It may have change in python 3.8. But I don't know how to interpret the "tasks which release the GIL" the documentation.
ProcessPoolExecutor could help, but because it requires to serialize input and output of function, with huge data volume, it won't be faster.
推荐阅读
- java - 为什么我们在接口的实现名称前加上前缀而不是后缀?
- android - Firebase 数据消息到特定的 android 应用程序版本
- reactjs - Component stopped to re-render after using privateroute
- c++ - 为什么在 unordered_map 中使用 find() 比直接读取要快得多?
- java - Is it possible to have a Java annotation that doesn't apply to any class, method, field, etc. Just the annotation itself generating code
- c# - 使球体在两点之间的线渲染器中移动?
- node.js - Posting a multipart form with file in Axios, Nodejs
- roku - Click button not fire a click Event and how to navigate another panel in ROKU
- r - 在数据框中按组对行进行排序、排序或排名
- xml - Remove tags from XML in PowerShell