cluster-analysis,data-mining,gaussian,elki

The delta parameter in EM is necessary to detect convergence. Since EM uses soft assignments internally, it will continue updating the values to arbitrary digits (technically, it will eventually run out of precision, and stop). As long as you choose a small enough value, you should be fine. However, EM...

java,cluster-analysis,data-mining,elki,optics-algorithm

We currently do not provide ELKI on Maven. Thus, there currently is no Maven dependency. ELKI is changing quickly, and we do not provide a stable API. For example, in the next release NumberVector<? extends Number> will simplify to just NumberVector. Getting rid of this generic is nice, but it...

sparse-matrix,sparse,outliers,elki

ArrayAdapterDatabaseConnection is designed for dense data. For sparse data, it does not make much sense to first encode it into a dense array, then re-encode it into sparse vectors. Consider reading the data as sparse vectors directly to avoid overhead. The error you are seeing has a different reason, though:...

cluster-analysis,data-mining,dbscan,elki

If you use the -resulthandler ResultWriter and output to text, the cluster sizes will be at the top of each cluster file. The visualizer currently doesn't seem to show cluster sizes....

java,cluster-analysis,elki,clique

Note that CLIQUE produces overlapping clusters. Elements can be in 0 to many clusters at the same time. If you choose your parameters badly (and CLIQUE parameters seem to be really hard to choose), you will get weird results. In your case, it seems to be 11 clusters, despite your...

java,cluster-analysis,dbscan,elki

ELKI has a modular architecture. If you want your own data source, look at the datasource package, and implement the DatabaseConnection (JavaDoc) interface. If you want to process MyObject objects (the class you shared above will likely come at a substantial performance impact), that is not particularly hard. You need...

Update: missing values in the input data broke the histogram visualizer. This has been fixed, and will work in the next release (missing values will simply be ignored though - there won't be a separate histogram bar to indicate the number of missing values. Contributions are welcome!) The class label...

Use the -time option. Do not enable -verbose as this will usually slow down the process due to output of progress logging. You may also want to set -resulthandler DiscardResultHandler. ELKI also has a command line mode for scripting. The logging window always shows a command line parameters you have...

cluster-analysis,hierarchical-clustering,dbscan,elki,optics-algorithm

The problem with extracting clusters from the OPTICS plot is the first and last elements of a clsuter. Just from the plot, you cannot (to my understanding) decide whether the last element should belong to the previous cluster or not. Consider a plot like this * * * * *...

The ResultWriter is some of the oldest code in ELKI, and needs to be rewritten. It's rather generic - it tries to figure out how to best serialize output as text. If you want some specific format, or a specific subset, the proper way is to write your own ResultHandler....

cluster-analysis,data-mining,weka,dbscan,elki

Accessing the DBIDs of ELKI works, if you pay attention to how they are assigned. For a static database, getDBIDs() will return a RangeDBIDs object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs will be...

cluster-analysis,data-mining,dbscan,elki

Solution 1: Scale your data set to match your desired epsilon. In your case, scale z by 50. Solution 2: Use a weighted distance function. E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis. This may be...

cluster-analysis,data-mining,elki

ELKI is mostly used with numerical data. Currently, ELKI does not have a "mixed" data type, unfortunately. The ARFF parser will split your data set into multiple relations: a 1-dimensional numerical relation containing age a LabelList relation storing sex and region a 1-dimensional numerical relation containing salary a LabelList relation...

cluster-analysis,data-mining,k-means,elki

5 seconds is probably too fast to get you a reliable measurement. Furthermore, with a larger k the result may converge with fewer iterations. You may want to use -time to see the time needed to run the actual algorithm. With your method, parsing and writing will have a non-neglible...