python - Proper multiprocessing syntax to apply a function to each row of pandas dataframe instead of using chunks
问题描述
I have a dataframe with two columns where each of the two columns contains a list indices. I want to get the product of the two lists on a row by row level to create a multiindex.
For example, df1 below
|---------------------|------------------|
| col_a | col_b |
|---------------------|------------------|
| [A1, A2] | [B1] |
|---------------------|------------------|
| [A3] | [B2, B3] |
|---------------------|------------------|
would turn into this:
MultiIndex([('A1', 'B1'),
('A2', 'B1'),
('A3', 'B2'),
('A3', 'B3')],
names = ['col_a', 'col_b'], length = 4)
The size of df1 is about 50K rows, the average length of the list in col_a is 300, and the average length of the list in col_b is 30. Since this is a fairly hefty task, I have decided to go down the multiprocessing route and do the following:
def worker(x):
return(x.apply(lambda row: [tup for tup in itertools.product(*row)], axis = 1).sum()
if __name__ == '__main__':
num_processes = 56
data_split = np.array_split(df1, num_processes)
p = Pool(processes = num_processes)
output = p.map(worker, data_split)
output_tup = tuple(itertools.chain(*output))
mult_ind = pd.MultiIndex.from_tuples(output_tup, names = df1)
While this works much better than a regular apply, I want to move away from chunking and instead assign a single row to each of the 56 processes, have the worker function run on the row in each of the processes, pickle the output, and then assign a new row to whichever process is open and repeat this until all rows are complete.
I'm thinking that I need to use map_async or imap for this, but for the life of me have not been able to get the syntax to work for this. Is it possible to do what I am asking?
解决方案
推荐阅读
- excel - 按下 MS Word 或 Excel 中的“拼音指南”按钮时会调用哪个 VBA 函数或宏?
- python - 如何创建一个自动安装一些库的python程序
- javascript - 为什么 innerHTML 会截断我的字符串(以及如何修复它)?
- asp.net - 使用 Github Actions 发布 React+dotnet 核心应用程序抛出错误 MSB3073: The command "npm run build" exited with code 1
- shopify - Shopify 如何检查产品是否存在于购物车中
- c - C 代码在 VS 中工作,但在通过 valgrind 时失败并拒绝编译
- javascript - 以编程方式在 Material-UI 的数据网格中预选一行(React)
- sql - 查找最后一个和倒数第二个日期和相应的值
- javascript - 计算整体锻炼时间的“持续时间”
- python - 根据列表中的部分匹配字符串过滤 DataFrame