python - Crawling data faster in python
问题描述
I'm crawling data of 25GB of bz2 files. Right now I'm processing the zip file, open it, get the data of the sensors, get the median, then after I finish processing all the files, write them to excel file. It takes a full day to process those files, which is not bearable.
I want to make the process faster, so I want to fire as many threads, but how would I approach that problem ? A Pseudo code of the idea would be good.
The problem that I'm thinking of is I have time stamps for each day of the zip file. So for example I have day1 at 20:00, I need to process it's file then save it in a list, while other threads can process other days, but I need to sync the data to be in sequence in the written file in disk.
Basically I want to accelerate it faster.
Here is a pseudo code of the process file as shown by the answer
def proc_file(directoary_names):
i = 0
try:
for idx in range(len(directoary_names)):
print(directoary_names[idx])
process_data(directoary_names[idx], i, directoary_names)
i = i + 1
except KeyboardInterrupt:
pass
print("writing data")
general_pd['TimeStamp'] = timeStamps
general_pd['S_strain_HOY'] = pd.Series(S1)
general_pd['S_strain_HMY'] = pd.Series(S2)
general_pd['S_strain_HUY'] = pd.Series(S3)
general_pd['S_strain_ROX'] = pd.Series(S4)
general_pd['S_strain_LOX'] = pd.Series(S5)
general_pd['S_strain_LMX'] = pd.Series(S6)
general_pd['S_strain_LUX'] = pd.Series(S7)
general_pd['S_strain_VOY'] = pd.Series(S8)
general_pd['S_temp_HOY'] = pd.Series(T1)
general_pd['S_temp_HMY'] = pd.Series(T2)
general_pd['S_temp_HUY'] = pd.Series(T3)
general_pd['S_temp_LOX'] = pd.Series(T4)
general_pd['S_temp_LMX'] = pd.Series(T5)
general_pd['S_temp_LUX'] = pd.Series(T6)
writer = pd.ExcelWriter(r'c:\ahmed\median_data_meter_12.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
general_pd.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Sx to Tx are sesnor values..
解决方案
使用multiprocessing
,您似乎有一个非常简单的任务。
from multiprocessing import Pool, Manager
manager = Manager()
l = manager.list()
def proc_file(file):
# Process it
l.append(median)
p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)
# somehow save l to excel.
更新:由于您想保留文件名,可能作为 pandas 列,方法如下:
from multiprocessing import Pool, Manager
manager = Manager()
d = manager.dict()
def proc_file(file):
# Process it
d[file] = median # assuming file given as string. if your median (or whatever you want) is a list, this works as well.
p = Pool(4) # however many process you want to spawn
p.map(proc_file, your_file_list)
s = pd.Series(d)
# if your 'median' is a list
# s = pd.DataFrame(d).T
writer = pd.ExcelWriter(path)
s.to_excel(writer, 'sheet1')
writer.save() # to excel format.
如果您的每个文件都会产生多个值,您可以创建一个字典,其中每个元素都是包含这些值的列表
推荐阅读
- reactjs - 无法更新组件的状态
- ios - 我如何在 12 岁以下的 ios 中使用沙盒帐户测试应用内购买
- reactjs - 如何在循环 ReactJS 中向状态对象添加另一个对象
- nginx - 修复 Nginx 服务器中的 CORS?
- python - 如何访问 Python 列表中每个循环的三个项目?
- graphql - 我可以在 graphql 游乐场中将一个变量分配给突变的结果吗
- google-apps-script - 脚本要么给出 429 错误(请求太多),要么耗时太长。如何链接函数调用?
- r - 无法对 R 中的数据框应用 t 检验
- sql - BigQuery 插入值 AS,假定缺失列为空
- python - 如何使用opencv自动随机生成类似划痕的线条