python - finding KNN for larger Dataset
问题描述
I am trying to find the nearest neighbors for a data set A consisting of 25000 rows, for that, I am trying to fit a dataset B to the KNN model that consists of 13 million rows, the goal is to find s 25000 rows of dataset B which are similar to dataset A
model_knn= NearestNeighbors(n_neighbors=10, algorithm = 'kd_tree')
model_knn.fit(B)
knn_distances,knn_indices=model_knn.kneighbors(A.values, n_neighbors=10)
here when I am fitting B up to 600000 rows there is no issues
model_knn.fit(knn_test_pd[:600000])
beyond 600000 the model is not fitting, there is no error but for fitting 600000 it takes 2 sec beyond 600000 its taking hours and the data I'm fitting is scaled data
I tried splitting the data frame and fitting it is it a correct approach? then also the model is taking hours to fit
splited_B=np.array_split(B, 113)
model_knn= NearestNeighbors(n_neighbors=10, algorithm = 'kd_tree')
for df in splited_B:
model_knn.fit(df)
What shall I do to fit these big data to knn? Or is there another model similar to knn which can accept large datasets?
解决方案
您可以将数据集B拆分为 600,000 行块,从而为您提供 22 个数据集(分别为 22 个 KNN 模型)。
在预测中,对于 A 中的每一行,在这22 个模型中的每一个上找到最近的数据点;这为您提供了 22 个数据点。最后,在这 22 个点中搜索最近的点。
推荐阅读
- elasticsearch - Elassandra/Elastic Search 中的聚合、日期范围查询
- sql - SQL 语句忽略和缺少函数的表达式
- react-native - 如果 Realm.js 不存在,如何防止他们创建新用户?
- python - PyMongo 聚合从字符串创建管道
- r - 如何使用坐标和 R 中 shapefile 中的另一个值从栅格中提取值?
- java - 使用 Access VBA 获取 java jre/bin 文件路径
- python - 尝试调用 python 脚本时,批处理脚本无法正常工作
- python - 为什么 python 以东方字符写入我的 CSV 文件?
- javascript - 在 React 中使用箭头函数的 SetInterval 和回调
- ios - 无法将类型“[ViewController.MyStruct]”的值分配给类型“[ViewController.MyOtherStruct]”