首页 > 解决方案 > 有没有办法为大型数据集优化这个循环?

问题描述

我有一个数据集(python),我想在其中按类型(从 -3 到 3)对过去时间的差异进行分类,从均值和标准开始,数据集的长度几乎为 700,000,创建带有分类的数组

def tipo_(x):
    dm3 = data.Diff.mean()-2*data.Diff.std()
    dm2 = data.Diff.mean()-1*data.Diff.std()
    dm1 = data.Diff.mean()
    d2 = data.Diff.mean()+1*data.Diff.std() 
    d3 = data.Diff.mean()+2*data.Diff.std()
    Type2 = []
    if x <= dm3:
        Type2.append(-3)
    elif x <= dm2:
        Type2.append(-2)
    elif x <= dm1:
        Type2.append(-1)
    elif x >= d3:
        Type2.append(3)
    elif x >= d2:
        Type2.append(2)
    elif x > dm1:
        Type2.append(1)
    return Type2 


Tipo2 = np.array(list(map(tipo_,data.Diff))).flatten()

标签: pythonnumpydataframeloopsdictionary

解决方案


您可以使用一些掩码来避免遍历每个元素,即使用整个数据框作为调整函数的输入:

def tipo_(x):
    x = np.array(x.Diff)
    mean = np.mean(x)
    std = np.std(x)
    dm3 = mean-2*std
    dm2 = mean-std
    dm1 = mean
    d2 = mean+std
    d3 = mean+2*std
    
    out = np.ones(len(x))
    out[x <= dm3] = -3
    out[np.logical_and(x > dm3, x <= dm2)] = -2
    out[np.logical_and(x > dm2, x <= dm1)] = -1
    out[np.logical_and(x >= d2, x < d3)] = 2
    out[x >= d3] = 3

    return out

推荐阅读