首页 > 解决方案 > 优化 zarr 数组处理

问题描述

我有一个包含 80 个 5-D zarr 文件的列表(mylist),其结构如下(T、F、B、Az、El)。该数组的形状为 [24x4096x2016x24x8]。

我想提取切片数据并使用以下函数沿某个轴运行概率

def GetPolarData(mylist, freq, FreqLo, FreqHi):
    '''
    This function will take the list of zarr files (T, F, B, Az, El), open them, used selected frequency to return an array
    of files with Azimuth and Elevation probabilities
    '''

    ChanIndx = FreqCut(FreqLo, FreqHi,freq)
    
    if len(ChanIndx) != 0:
        MyData = []
        for i in range(len(mylist)):
            print('Adding file {} : {}'.format(i,mylist[i][32:]))
            try:
                zarrf = xr.open_zarr(mylist[i], group = 'arr')
                m = zarrf.master.sum(dim = ['time','baseline'])
                m = m[ChanIndx].sum(dim = ['frequency'])

                c = zarrf.counter.sum(dim = ['time','baseline'])
                c = c[ChanIndx].sum(dim = ['frequency'])

                p = m.astype(float)/c.astype(float)

                MyData.append(p)

            except Exception as e:
                print(e)
                continue

    else:
        print("Something went wrong in Frequency selection")
                
    print("##########################################")
    print("This will be contribution to selected band")
    print("##########################################")

    print(f"Min {np.nanmin(MyData)*100:.3f}%  ")
    print(f"Max {np.nanmax(MyData)*100:.3f}%  ")
    print(f"Average {np.nanmean(MyData)*100:.3f}%  ")
    return(MyData) 

如果我使用以下方法调用该函数,

FreqLo = 470.
FreqHi = 854.
MyTVData =np.array(GetPolarData(AllZarrList,Freq, FreqLo, FreqHi))

我发现在 40 核、256 GB RAM 上运行需要很长时间(超过 3 小时)

有没有办法让它运行得更快?

谢谢

标签: pythonruntimezarr

解决方案


推荐阅读