首页 > 解决方案 > finding KNN for larger Dataset

问题描述

I am trying to find the nearest neighbors for a data set A consisting of 25000 rows, for that, I am trying to fit a dataset B to the KNN model that consists of 13 million rows, the goal is to find s 25000 rows of dataset B which are similar to dataset A

model_knn= NearestNeighbors(n_neighbors=10, algorithm = 'kd_tree')
model_knn.fit(B)

knn_distances,knn_indices=model_knn.kneighbors(A.values, n_neighbors=10)

here when I am fitting B up to 600000 rows there is no issues

model_knn.fit(knn_test_pd[:600000])

beyond 600000 the model is not fitting, there is no error but for fitting 600000 it takes 2 sec beyond 600000 its taking hours and the data I'm fitting is scaled data

I tried splitting the data frame and fitting it is it a correct approach? then also the model is taking hours to fit

splited_B=np.array_split(B, 113)
model_knn= NearestNeighbors(n_neighbors=10, algorithm = 'kd_tree')
for df in splited_B:
    model_knn.fit(df)

What shall I do to fit these big data to knn? Or is there another model similar to knn which can accept large datasets?

标签: pythonpandasmachine-learningscikit-learnknn

解决方案


您可以将数据集B拆分为 600,000 行块,从而为您提供 22 个数据集(分别为 22 个 KNN 模型)。

在预测中,对于 A 中的每一行,在22 个模型中的每一个上找到最近的数据点;这为您提供了 22 个数据点。最后,在这 22 个点中搜索最近的点。


推荐阅读