首页 > 解决方案 > Pandas pd.concat() in separate threads shows no speed-up

问题描述

I am trying to use pandas in a multi-thread environment. I have a few lists of pandas frames (long list, 5000 pandas frames, with dimensions of 300x2500 dimension) which I need to concatenate. Since I have multiple lists, I want to run the concat for each list in an own thread (or use a threadpool, at least to get some parallel processing).

For some reason the processing in my multi-thread setup is identical to single threaded processing. I am wondering if I am doing something systematically wrong.

Here is my code snippet, I use ThreadPoolExecutor to implement parallelization:


def func_merge(the_list, key):
    return (key, pd.concat(the_list))

def my_thread_starter():
    buffer = {
              'A': [df_1, ..., df_5000], 
              'B': [df_a1, ...., df_a5000]
              }
    with ThreadPoolExecutor(max_workers=2) as executor:
        submitted=[]

        for key, df_list in buffer.items():
            submitted.append(executor.submit(func_merge, df_list, key = key))

        for future in as_completed(submitted):
            out = future.result()
            // do with results

Is there a trick to use Pandas' concat in separate threads? I would at least expect my CPU utilization to spark when running more threads but it does seem to have any effect. Consequently, the time advantage is zero, too

Does anyone has an idea what the problem could be?

标签: pythonpandasmultithreading

解决方案


Because of the Global Interpreter Lock -GIL), I'm not sure your code is leveraging multi-threading. Basically, ThreadPoolExecutor is useful when workload is not CPU bounded but IO bounded, like making many Web API call at the same time.

It may have change in python 3.8. But I don't know how to interpret the "tasks which release the GIL" the documentation.

ProcessPoolExecutor could help, but because it requires to serialize input and output of function, with huge data volume, it won't be faster.


推荐阅读