首页 > 解决方案 > Crawling data faster in python

问题描述

I'm crawling data of 25GB of bz2 files. Right now I'm processing the zip file, open it, get the data of the sensors, get the median, then after I finish processing all the files, write them to excel file. It takes a full day to process those files, which is not bearable.

I want to make the process faster, so I want to fire as many threads, but how would I approach that problem ? A Pseudo code of the idea would be good.

The problem that I'm thinking of is I have time stamps for each day of the zip file. So for example I have day1 at 20:00, I need to process it's file then save it in a list, while other threads can process other days, but I need to sync the data to be in sequence in the written file in disk.

Basically I want to accelerate it faster.

Here is a pseudo code of the process file as shown by the answer

def proc_file(directoary_names):
    i = 0

    try:

        for idx in range(len(directoary_names)):
            print(directoary_names[idx])
            process_data(directoary_names[idx], i, directoary_names)
            i = i + 1
    except KeyboardInterrupt:
       pass

    print("writing data")
    general_pd['TimeStamp'] = timeStamps
    general_pd['S_strain_HOY'] = pd.Series(S1)
    general_pd['S_strain_HMY'] = pd.Series(S2)
    general_pd['S_strain_HUY'] = pd.Series(S3)
    general_pd['S_strain_ROX'] = pd.Series(S4)
    general_pd['S_strain_LOX'] = pd.Series(S5)
    general_pd['S_strain_LMX'] = pd.Series(S6)
    general_pd['S_strain_LUX'] = pd.Series(S7)
    general_pd['S_strain_VOY'] = pd.Series(S8)
    general_pd['S_temp_HOY'] = pd.Series(T1)
    general_pd['S_temp_HMY'] = pd.Series(T2)
    general_pd['S_temp_HUY'] = pd.Series(T3)
    general_pd['S_temp_LOX'] = pd.Series(T4)
    general_pd['S_temp_LMX'] = pd.Series(T5)
    general_pd['S_temp_LUX'] = pd.Series(T6)
    writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
    # Convert the dataframe to an XlsxWriter Excel object.
    general_pd.to_excel(writer, sheet_name='Sheet1')
    # Close the Pandas Excel writer and output the Excel file.
    writer.save()

Sx to Tx are sesnor values..

标签: pythonmultithreading

解决方案


使用multiprocessing,您似乎有一个非常简单的任务。

from multiprocessing import Pool, Manager

manager = Manager()
l = manager.list()

def proc_file(file):
    # Process it
    l.append(median)

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

# somehow save l to excel. 

更新:由于您想保留文件名,可能作为 pandas 列,方法如下:

from multiprocessing import Pool, Manager

manager = Manager()
d = manager.dict()

def proc_file(file):
    # Process it
    d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.

p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)

s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.

如果您的每个文件都会产生多个值,您可以创建一个字典,其中每个元素都是包含这些值的列表


推荐阅读