首页 > 解决方案 > 在 python 3 中为大型数据集查找阈值内的点的最有效方法是什么?

问题描述

我目前正在处理一个庞大的数据集,其中包含一个包含 1,000,000 行的 x,y 坐标的大型数据框。我有很多这样的数据集,需要在给定关键点集周围的某个阈值内搜索点。

这是我到目前为止的工作代码,但这太慢了。例如,查找 10 个关键点附近的点大约需要 5 分钟。当我使用一个大型数据帧和一些小的关键点集时,这没关系,但是当它们中有很多时,计算成本变得难以承受。

def isolate_cluster_center_strict(dataA,dataB, center_value, resolution=[0.01,0.01]):
"""
dataA, dataB : pandas series for the original data set for which search is to be performed,
    a column from the dataframe,  

center_value_x, center_value_y : x and y values we wish to search for 

resolution : list,
        list containing the resolution in data1 and data2,
        default value is [0.01,0.01] or 1 x 10^-2
"""
data1=pd.Series.tolist(dataA)
data2=pd.Series.tolist(dataB)
data_1_ls=[]
index_1_ls =[]
count_1_ls=[]
data_2_ls=[]
index_2_ls =[]
count_2_ls=[]
count_1=0
count_2=0
for i in range(len(data1)):
    

    if (data1[i] < center_value[0] + resolution[0] and data1[i] > center_value[0] - resolution[0]):
        if (data2[i] < center_value[1] + resolution[1] and data2[i] > center_value[1]- resolution[1]):
        #print(data1[i])
            t =data1[i]
            dat_1 , ind_1 = cluster_center(data1,t)
            count_1 = count_1 + 1
            if count_1 < 10: # only take up to 10 instance for each key structure
                data_1_ls.append(dat_1)
                index_1_ls.append(ind_1)
                count_1_ls.append(count_1)
            
            t2 =data2[i]
            dat_2 , ind_2 = cluster_center(data2,t2)
            count_2 = count_2 +1
            if count_2 <10:
                data_2_ls.append(dat_2)
                index_2_ls.append(ind_2)
                count_2_ls.append(count_2)

df_1 = pd.DataFrame({"data1": data_1_ls,
                     "data2": data_2_ls,
                    "index": index_1_ls,
                    "count": count_1_ls})

return df_1

期待见解和建议

标签: pythonsearchdataset

解决方案


推荐阅读