python - Pandas Dataframe 的多处理写入 Excel 工作表
问题描述
我有工作代码可以从大型数据框编写到 Excel 文件中的单独工作表,但这需要很长时间,大约 30-40 分钟。我想找到一种方法让它使用多处理更快地运行。
我尝试使用多处理重写它,以便可以与多个处理器并行完成对每个 excel 选项卡的写入。修改后的代码运行没有错误,但也没有正确写入 excel 文件。任何的意见都将会有帮助。
代码的原始工作部分:
import os
from excel_writer import append_df_to_excel
import pandas as pd
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
append_df_to_excel(path, df_col, sheet_name = column, truncate_sheet = True,
startrow = 0) # Add data to excel file
修改后的代码尝试多处理:
import os
from excel_writer import append_df_to_excel
import pandas as pd
import multiprocessing
def data_to_excel(col, excel_fn, data):
data_fr = pd.DataFrame(data) # switch list back to dataframe for putting into excel file sheets
append_df_to_excel(excel_fn, data_fr, sheet_name = col, truncate_sheet = True, startrow = 0) # Add data to sheet in excel file
if __name__ == "__main__":
path = os.path.dirname(
os.path.abspath(__file__)) + '\\fund_data.xlsx' # get path to current directory and excel filename for data
data_cols = df_all.columns.values.tolist() # Create a list of the columns in the final dataframe
# print(data_cols)
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
for column in data_cols: # For each column in the dataframe
df_col = df_all[column].unstack(level = -1) # unstack so Dates are across the top oldest to newest
df_col = df_col[df_col.columns[::-1]] # reorder for dates are newest to oldest
# print(df_col)
data_col = df_col.values.tolist() # convert dataframe coluumn to a list to use in pool
pool.apply_async(data_to_excel, args = (column, path, data_col))
pool.close()
pool.join()
解决方案
我不知道从多个进程写入单个文件的正确方法。我需要解决类似的问题。我通过创建编写器进程来解决它,该进程使用Queue获取数据。您可以在此处查看我的解决方案(抱歉,它没有记录在案)。
简化版(草案)
from multiprocessing import Queue
input_queue = Queue()
res_queue = Queue()
process_list = []
def do_calculation(input_queue, res_queue, calculate_function):
try:
while True:
data = in_queue.get(False)
try:
res = calculate_function(**data)
out_queue.put(res)
except ValueError as e:
out_queue.put("fail")
logging.error(f" fail on {data}")
except queue.Empty:
return
# put data in input queue
def save_process(out_queue, file_path, count):
for i in range(count):
data = out_queue.get()
if data == "fail":
continue
# write to excel here
for i in range(process_num):
p = Process(target=do_calculation, args=(input_queue, res_queue, calculate_function))
p.start()
process_list.append(p)
p2 = Process(target=save_process, args=(res_queue, path_to_excel, data_size))
p2.start()
p2.join()
for p in process_list:
p.join()
推荐阅读
- angular - 使用 ngModel 和 ngValue Angular 4 和 JHipster 时获取未定义的值
- wpf - WPF 绑定未引发数据模板类型扩展器的属性更改事件
- c - 为什么我会收到陷阱 14 消息,其中动态数组保存在堆栈中
- java - 如何使用 Ajax 将 system.out.println("xyz") 语句从逻辑方法打印到 GUI
- html - 透明图像以查看彩色背景
- reactjs - nextjs 重定向默认路由
- linux - 通过终端发送 Ngrok 随机端口号
- python - 从具有该子变量的子类调用父方法
- python - 如何提取评论中“阅读更多”后面的文字?
- c# - 如何一次增加两个列表?