scikit-learn,cluster-analysis,data-mining,predict,dbscan

Clustering is not classification. Clustering is unlabeled. If you want to squeeze it into a prediction mindset (which is not the best idea), then it essentially predicts without learning. Because there is no labeled training data available for clustering. It has to make up new labels for the data, based...

python-2.7,cluster-analysis,hierarchical-clustering,outliers,dbscan

If finding the appropriate value of epsilon is a major problem, the real problem may be long before that: you may be using the wrong distance measure all the way, or you may have a preprocessing problem. Your code looks a lot like a naive preprocessing approach - and that...

cluster-analysis,data-mining,k-means,hierarchical-clustering,dbscan

Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation. Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can...

cluster-analysis,hierarchical-clustering,dbscan,elki,optics-algorithm

The problem with extracting clusters from the OPTICS plot is the first and last elements of a clsuter. Just from the plot, you cannot (to my understanding) decide whether the last element should belong to the previous cluster or not. Consider a plot like this * * * * *...

Functions in R are usually designed to pass parameters by value and not by reference. Updating the value of the variables passed in will not change them in the calling environment. Generally speaking, the R way to do this is for your function to return the updated data. If you...

python,cluster-analysis,dbscan

Your expandCluster function forgets the new neighbors. Your set update is swapped....

cluster-analysis,data-mining,dbscan,elki

Solution 1: Scale your data set to match your desired epsilon. In your case, scale z by 50. Solution 2: Use a weighted distance function. E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis. This may be...

cluster-analysis,data-mining,weka,dbscan,elki

Accessing the DBIDs of ELKI works, if you pay attention to how they are assigned. For a static database, getDBIDs() will return a RangeDBIDs object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs will be...

cluster-analysis,data-mining,dbscan

Yes, if used with Euclidean distance, then it is a radius. It is not infinitely small (it does not tend to 0). It's just supposed to be small compared to the data set extends, but the authors could have named it "r" instead. Use the original paper to understand the...

java,cluster-analysis,dbscan,apache-commons-math

Try the version in ELKI instead. Apache commons math is unfortunately not very good. I moved away from commons-math because of various small issues. ELKI works much better for me. From a quick look, commons-math is still pretty dead when it comes to cluster analysis... it was last touched for...

java,cluster-analysis,dbscan,elki

ELKI has a modular architecture. If you want your own data source, look at the datasource package, and implement the DatabaseConnection (JavaDoc) interface. If you want to process MyObject objects (the class you shared above will likely come at a substantial performance impact), that is not particularly hard. You need...

cluster-analysis,data-mining,dbscan,elki

If you use the -resulthandler ResultWriter and output to text, the cluster sizes will be at the top of each cluster file. The visualizer currently doesn't seem to show cluster sizes....