首页 > 解决方案 > 如何使用 DBSCAN 设置好的参数对高密度数据进行聚类?

问题描述

我想使用 DBSCAN 根据给定位置(X,Y,Z)对一些星星进行聚类,我不知道如何调整数据以获得正确数量的聚类,然后再绘制它?

这就是数据的样子 在此处输入链接描述 这些数据的正确参数是什么?

行数为 1.202672e+06

import pandas as pd
data = pd.read_csv('datasets/full_dataset.csv')
from sklearn.cluster import DBSCAN
clusters=DBSCAN(eps=0.5,min_samples=40,metric="euclidean",algorithm="auto")

标签: pythoncluster-analysisdata-sciencedbscan

解决方案


min_samples is arguably one of the tougher ones to choose, but you can decide that by just looking at the results and deciding how much noise you are okay with.

Choosing eps can be aided by running k-NN to understand the density distribution of your data. I believe that the DBACAN paper recommends in more detail. There might even be a way to plot this in python (in R it is kNNdistplot).

I would prefer to use OPTICS, which is essentially doing all eps values simultaneously. However, I haven't found a decent implementation of this in either in python or R. In fact, there is an incorrect implementation in python which doesn't follow the original OPTICS paper at all.

If you really want to use optics, I recommend using a java implementation available using ELKI.

If anyone else has heard of a proper python implementation, I'd love to hear it.

If you want to go the trial and error route, start eps much smaller and go from there.


推荐阅读