First time here? Checkout the FAQ!
x
+1 vote
8.2k views
asked in Machine Learning by (450 points)  
  

1 Answer

+1 vote
answered by (116k points)  

There is no general way of choosing minPts. It depends on the context of the problem and what you are looking for. Similar to other unsupervised learning problems, the results could be totally wrong, even if you choose some optimal values for the hyperparameters. However, we can mention to some facts:

 

Rule of Thumb values for minPts and eps:

In this page, the rule of thumb values is discussed. 

minPts:

A low minPts will create clusters for outliers or noise. A low minPts means it will build more clusters from outliers, therefore we don't choose a too small value for it. minPts is best set by a domain expert who understands the data well. Unfortunately many cases we don't know the domain knowledge, especially after data is normalized. One heuristic approach is using $\ln(n)$, where $n$ is the total number of points to be clustered.

epsilon (eps):
For epsilon, there are various aspects. It again boils down to choosing whatever works on this data set and this minPts and this distance function and this normalization. You can try to do a kNN distance (k-distance plot) histogram for your dataset and choose a "knee" there, but there might be no visible one, or multiple. 

Basically, you compute the k-nearest neighbors (k-NN) for each data point to understand what is the density distribution of your data, for different k. the KNN is handy because it is a non-parametric method. Once you choose a minPts (which strongly depends on your data again), you fix k to that value. Then you use an epsilon the k-distance corresponding to the area of the k-distance plot (for your fixed k) with a low slope. The method proposed consists of computing the k-nearest neighbor distances in a matrix of points. The idea is to calculate the average of the distances of every point to its k nearest neighbors. The value of $k$ will be specified by the user and corresponds to minPts. Next, these k-distances are plotted in ascending order. The aim is to determine the “knee”, which corresponds to the optimal eps parameter. A knee corresponds to a threshold where a sharp change occurs along the k-distance curve. It can be seen that the optimal eps value is around a distance of 0.15.

 

OPTICS and other extensions

Some extensions on top of the DBSCAN is created such as OPTICS.  OPTICS produce hierarchical clusters, we can extract significant flat clusters from the hierarchical clusters by visual inspection, OPTICS implementation is available in Python module pyclustering. One of the original authors of DBSCAN and OPTICS also proposed an automatic way to extract flat clusters, where no human intervention is required, for more information you can read this paper. Also, there are some other popular extensions such as HDBSCAN that could be found here.

At the end of the day, similar to other clustering algorithms, because we do not have labels, it is hard to get reliable results using clustering algorithms, and it is still one of the areas of need improvement in the future.

...