I found a way. You can use the seqdumper to extract the cluster mapping: ./mahout seqdumper -i /tmp/es-kmeans/clusteredPoints/part-m-00000 -o /tmp/cluster-points.txt Than you can use a regex to extract the mapping of the vector IDs to cluster IDs....

algorithm,graph,cluster-analysis

If the edge weights are proportional to the benefit gained by reclustering the points, then a decent heuristic is to pick the cycle with the highest total weight. I think this addresses all three of your cases: whenever you have a choice, pick the one that will do the most...

cluster-analysis,data-mining,dbscan,elki

Solution 1: Scale your data set to match your desired epsilon. In your case, scale z by 50. Solution 2: Use a weighted distance function. E.g. WeightedEuclideanDistanceFunction in ELKI, and choose your weights accordingly, e.g. -distance.weights 1,1,50 will put 50x as much weight on the third axis. This may be...

stream,machine-learning,cluster-analysis

By definition k-means cannot create overlapping clusters, as points are assigned to the closest centroid. There are other clustering methods which can produce overlaping ones, such as fuzzy k-means etc.

python,math,scikit-learn,cluster-analysis,data-mining

You can easily do this using spectral clustering. You can use the ready implementations such as the one in sklearn or implement it yourself. It is rather easy an easy algorithm. Here is a piece of code doing it in python using sklearn: import numpy as np from sklearn.cluster import...

machine-learning,cluster-analysis,k-means,hierarchical-clustering

The idea I'm suggesting is originated in text-processing, NLP and information retrieval and very widely used in situations where you have sequences of characters/information like Genetic information. Because you have to preserve the sequence, we can use the concepts of n-grams. I'm using bi-grams in the following example, though you...

You can use ddply from plyr to do this easily. library(plyr) ddply(df,.(cluster),summarise,variance1 = var(V1),variance2 = var(V2),mean1 = mean(V1),...) You can also do it this way, ddply(df,.(cluster),function(x){ res = c(as.numeric(colwise(var)(x)),as.numeric(colwise(mean)(x))) names(res) = paste0(rep(c('Var','Mean'),each = 4),rep(1:4,2)) res }) ...

Your problem is that the document term matrix dtm should be indexed as a 2D matrix with the row and col. It should be findFreqTerms(dtm[cl$cluster==1,], 50) (notice the extra comma, we leave the second parameter empty so all columns are returned)...

r,statistics,cluster-analysis,k-means

The amount of variance explained is related to the two principal components calculated to visualize your data. This has nothing to do with the type of clustering algorithm or the accuracy of the algorithm that you're using (kmeans in this case). To understand how accurate your clustering algorithm is at...

I get the answer myself. test = matrix(rnorm(200), 20, 10) test[1:10, seq(1, 10, 2)] = test[1:10, seq(1, 10, 2)] + 3 test[11:20, seq(2, 10, 2)] = test[11:20, seq(2, 10, 2)] + 2 test[15:20, seq(2, 10, 2)] = test[15:20, seq(2, 10, 2)] + 4 colnames(test) = paste("Test", 1:10, sep = "")...

You use the closest variable inside parallel loop, both writing to it and using it as a check before increasing your moves variable. But it is declared outside the loop, so all iterations use the same variable! As all iterations are executed (theoretically) in parallel, you cannot expect that any...

cluster-analysis,data-mining,k-means,elki

5 seconds is probably too fast to get you a reliable measurement. Furthermore, with a larger k the result may converge with fewer iterations. You may want to use -time to see the time needed to run the actual algorithm. With your method, parsing and writing will have a non-neglible...

Yes, K is the number of natural grouping we believe their is in the data. You can find K by exploring the eigenvalues. One tool which is particularly designed for spectral clustering is the Eigengap heuristic (also called spectral gap) - number of clusters k is usually given by the...

cluster-analysis,data-mining,k-means,hierarchical-clustering,dbscan

Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation. Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can...

algorithm,cluster-analysis,spell-checking,levenshtein-distance

Do not use clustering. It will produce a lot of false positives. It will consider “Sam” and “Pam” highly similar. Instead look at spelling correction, or define a Levenshtein distance threshold. But something that considers typo behavior will work even better than such a native letter approach ....

python-2.7,image-processing,cluster-analysis

The code is correct, (it work fine for me). It just take time, to finish (80 second on my laptop). Maybe you need a image in grayscale like lena image. For the downsampling; lena = sp.misc.lena() print np.shape(lena) print np.shape(lena[::2, ::2]) # lena[0,0], lena[0,2], lena[0,4], lena[0,6] print np.shape(lena[1::2, ::2]) #...

You could fill the missing values with zeroes. The operator Replace Missing Values does this. I don't know the details of your data nor how RapidMiner calculates DTW distances so I therefore can't tell if this approach would yield valid results. Faced with this, I might use the R extension...

You could try a hierarchical clustering using a binary distance measure like jaccard, if "clicked a link" is asymmetrical: dat <- read.table(header = TRUE, row.names = 1, text = "User link1 link2 link3 link4 abc1 0 1 1 1 abc2 1 0 1 0 abc3 0 1 1 1 abc4...

scikit-learn,cluster-analysis,dirichlet

There may be some hidden numerical issues happening here. The problem is the high dimensionality of your data set. This will lead to infinitly small likelihoods in Gaussian mixture modeling, thus making the model very very unlikely. At some point, you appear to get an -inf value, and then it...

machine-learning,cluster-analysis,similarity,unsupervised-learning

First, compute your profiles as you already did. Then the crucial step will be some kind of normalization. You can either divide the numbers by their total so that the numbers sum up to 1, or you can divide them by their Euclidean norm so that they have Euclidean norm...

excel,matlab,cluster-analysis,k-means,geo

I think you are looking for "path planning" rather than clustering. The traveling salesman problem comes to mind If you want to use clustering to find the individual regions you should find the coordinates for each location with respect to some global frame. One example would be using Latitude and...

algorithm,cluster-analysis,computational-geometry,hamming-distance

One of the previously asked questions has some good discussions, so you can refer to that, Nearest neighbors in high-dimensional data? Other than this, you can also look at, http://web.cs.swarthmore.edu/~adanner/cs97/s08/papers/dahl_wootters.pdf Few papers which analyze different approaches, http://www.jmlr.org/papers/volume11/radovanovic10a/radovanovic10a.pdf https://www.cse.ust.hk/~yike/sigmod09-lsb.pdf...

sql,database,cluster-analysis,geospatial,computational-geometry

You can try a voronoi diagram. A voronoi diagram is the dual of a delaunay triangulation with some useful properties: https://alastaira.wordpress.com/2011/04/25/nearest-neighbours-voronoi-diagrams-and-finding-your-nearest-sql-server-usergroup.

r,cluster-analysis,data-visualization,phylogeny

When you say "color branches" I assume you mean color the edges. This seems to work, but I have to think there's a better way. Using the built-in mtcars dataset here, since you did not provide your data. plot.fan <- function(hc, nclus=3) { palette <- c('red','blue','green','orange','black')[1:nclus] clus <-cutree(hc,nclus) X <-...

python,scikit-learn,cluster-analysis,nearest-neighbor

The data used to fit the model are stored in neigh._fit_X: >>> neigh._fit_X array([[ 0. , 0. , 0. ], [ 0. , 0.5, 0. ], [ 1. , 1. , 0.5]]) However: The leading underscore of the variable name should be a signal to you that this is supposed...

solr,cluster-analysis,k-means,workbench,carrot

Your suspicion is correct, it is a heap size problem, or more precisely, a scalability constraint. Straight from the carrot2 FAQs: http://project.carrot2.org/faq.html#scalability How does Carrot2 clustering scale with respect to the number and length of documents? The most important characteristic of Carrot2 algorithms to keep in mind is that they...

cluster-analysis,hierarchical-clustering,dbscan,elki,optics-algorithm

The problem with extracting clusters from the OPTICS plot is the first and last elements of a clsuter. Just from the plot, you cannot (to my understanding) decide whether the last element should belong to the previous cluster or not. Consider a plot like this * * * * *...

python,machine-learning,cluster-analysis

You could try some of the clustering methods, which belong to unsupervised machine learning. The result depends on your data and how distributed they are. According to your picture I think K-means algorithm could work. There is a python library for machine learning scikit-learn, which already contains k-means implementation: http://scikit-learn.org/stable/modules/clustering.html#k-means

python,classification,cluster-analysis,data-mining,k-means

there are two related (but distinct technique-wise) questions here; the first is relates to choice of clustering technique for this data. the second, predicate question relates to the data model--ie, for each sentence in the raw data, how to transform it to a data vector suitable for input to a...

machine-learning,cluster-analysis,pca,eigenvalue,eigenvector

As far as I can tell, you have mixed and shuffled aa number of approaches. No wonder it doesn't work... you could simply use jaccard distance (a simple inversion of jaccard similarity) + hierachical clustering you could do MDS to project you data, then k-means (probably what you are trying...

You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more. If...

cluster-analysis,mahout,hierarchical-clustering

I don't know how to answer "how to scale hclust up?" for your dataset. Throw more hardware/RAM at the problem, and/or search for a clever distributed implementation (Spark MLLib 1.4 does not implement Hierarchical Clustering, though). You question is slightly confusing, read on why I think so. I don't understand...

python,cluster-analysis,dbscan

Your expandCluster function forgets the new neighbors. Your set update is swapped....

Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers"). For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix...

As cmdscale needs distances, try cmdscale(dist(points), eig = TRUE, k = 2). Symbols do not appear because of type = "n". For coloring text, use: text(x, y, rownames(points), cex = 0.6, col = centroids)

algorithm,matlab,cluster-analysis,vectorization

here's my oneliner: A= reshape(sum(((kron(x,ones(m,1))-repmat(x,m,1)).^2),2),[m m]); ...

algorithm,data-structures,cluster-analysis

According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are: Combination of k-means and agglomerative clustering (bottom-up) topic modeling co-clustering. But I can't tell how to apply these on your dataset. It's big...

You can use bsxfun to compare the columns C=squeeze(sum(bsxfun(@eq, A,permute(A, [1 3 2])))); C(logical(eye(size(C))))=1; C = C/3; You first permute your 3x5 matrix to a 3x1x5 matrix, then using bsxfun(@eq, ... you compare columns of the original matrix, and then sum up the locations where the elements on two different...

r,performance,matrix,cluster-analysis,sparse-matrix

I've written some Rcpp code and R code which works out the binary/Jaccard distance of a binary matrix approx. 80x faster than dist(x, method = "binary"). It converts the input matrix into a raw matrix which is the transpose of the input (so that the bit patterns are in the...

TF-IDF is defined on two documents or sentences (two word vectors). You seem to want to apply it one-vs-many? There is no theoretically backed approach for this. But you can use any of the heuristics common in HAC, because that appears to be what you are reinventing....

python,cluster-analysis,igraph

Why do you think the optimal cluster count should be 3? It seems to me that all the nodes have fairly strong connections to each other (they have a weight of 50), except two small groups where the connections are weaker. Note that clustering methods in igraph expect the weights...

java,cluster-analysis,elki,clique

Note that CLIQUE produces overlapping clusters. Elements can be in 0 to many clusters at the same time. If you choose your parameters badly (and CLIQUE parameters seem to be really hard to choose), you will get weird results. In your case, it seems to be 11 clusters, despite your...

It happens that you capture only the cluster element of the return value of kmeans, which returns also the centers of the clusters. Try this: #generate some data traindata<-matrix(rnorm(400),ncol=2) traindata=scale(traindata,center = T,scale=T) # Feature Scaling #get the full kmeans km.cluster = kmeans(traindata, 2,iter.max=20,nstart=25) #define a (euclidean) distance function between two...

r,cluster-analysis,data-mining,k-means

You already have the information you want: kmeansRes<-factor(km$cluster) just add it to your data frame as additional column. df1$cluster <- km$cluster ...

If you set the keep text check box on the Process Documents operator, the original text will accompany the attributes corresponding to the tokenized text.

The problem has to do with a name collision. The clustering toolbox has a Kmeans function. However, the MATLAB statistics toolbox has its own kmeans function. It could simply be that the clustering toolbox directories are lower in your path than the MATLAB builtin ones. So the first thing to...

matlab,time-series,cluster-analysis

It's hard to tell without knowing what dtw2 is. But here's one possible cause: When you call pdist2 with a custom distance function, it should meet the following condition: A distance function must be of the form function D2 = DISTFUN(ZI,ZJ) taking as arguments a 1-by-N vector ZI containing a...

python,machine-learning,cluster-analysis,k-means

As per my knowledge clustering becomes very memory intensive as the size increases, you will have to figure out a way to reduce the dimensionality of your data. I am not familiar with ROCK but I've worked on clustering problems before wherein I had to cluster millions of documents. Distance...

r,csv,amazon-ec2,cluster-analysis

Use algorithms that do not require O(n^2) memory, if you have a lot of data. Swapping to disk will kill performance, this is not a sensible option. Instead, try either to reduce your data set size, or use index acceleration to avoid the O(n^2) memory cost. (And it's not only...

python-2.7,machine-learning,cluster-analysis,data-mining,k-means

Interpreted Python code is slow. Really slow. That is why the good python toolkits contain plenty of Cython code and even C and Fortran code (e.g. matrix operations in numpy), and only use Python for driving the overall process. You may be able to speed up your code substantially if...

opencv,machine-learning,computer-vision,histogram,cluster-analysis

In order to assign the descriptors to clusters, you have to choose a distance metric. A simple choice would be the Euclidean distance. Then you need to compute the distance from the training descriptors to each cluster centroid, and assign them to the cluster whose centroid is closer to the...

ruby,machine-learning,cluster-analysis,k-means,hierarchical-clustering

k-means is the wrong tool for this job anyway. It's not designed for fitting an exponential curve. Here is a much more sound proposal for you: Look at the plot, mark the three points, and then you have your three groups. Or look at quantiles... Report the median response time,...

matlab,machine-learning,cluster-analysis,sift,feature-extraction

With SIFT you usually don't compute the feature for the whole image, but literally hundreds of keypoints for each image. Then you won't have this problem anymore. Next step then probably is a “bag of visual words” mapping of each image....

r,cluster-analysis,hierarchical-clustering,hclust

You can use hclust models <- apply(combinations, 1, function(x) hclust(dist(dat[,x]))) clusters <- apply(combinations, 1, function(x) cutree(hclust(dist(dat[,x])), k = 3)) ...

java,cluster-analysis,data-mining,elki,optics-algorithm

We currently do not provide ELKI on Maven. Thus, there currently is no Maven dependency. ELKI is changing quickly, and we do not provide a stable API. For example, in the next release NumberVector<? extends Number> will simplify to just NumberVector. Getting rid of this generic is nice, but it...

machine-learning,cluster-analysis,weka

You should drop the class attribute before you do clustering. It has too much predictive power, and as a consequence of this, the clustering algorithm has a strong bias to prefer the class attribute internally. You can do this attribute removal in the "Preprocess" panel by clicking the "remove" button,...

java,cluster-analysis,dbscan,apache-commons-math

Try the version in ELKI instead. Apache commons math is unfortunately not very good. I moved away from commons-math because of various small issues. ELKI works much better for me. From a quick look, commons-math is still pretty dead when it comes to cluster analysis... it was last touched for...

graph,cluster-analysis,partitioning,image-segmentation

I have worked on this topic as part of my PhD research. You can find a few optimization algorithms here. You might find that combining these optimization algorithms with a multiscale framework can help both speeding up the run-time and improve the convergence of the optimization process.

java,cluster-analysis,dbscan,elki

ELKI has a modular architecture. If you want your own data source, look at the datasource package, and implement the DatabaseConnection (JavaDoc) interface. If you want to process MyObject objects (the class you shared above will likely come at a substantial performance impact), that is not particularly hard. You need...

matlab,cluster-analysis,k-means

OK, now your question is more clear. I didn't understand what you meant in your other post. Alright, it looks like your T is your transmission alphabet. Be advised that the clusters that you get through k-means will probably not be the same as those from your transmission alphabet, so...

This is not cluster analysis. It's not about finding structural components in your data set. Instead, what you are looking for is a JOIN of the two data sets. If you allow each point to map to multiple points in the other data set, then it is a nearest neighbor...

python,python-2.7,scikit-learn,cluster-analysis,k-means

This is a simpler example: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of...

I'm not sure why but the silhouette call seems to drop the row names. You can add them back with cl <- hclust(as.dist(tmp,diag = TRUE, upper = TRUE), method= 'single') sil_cl <- silhouette(cutree(cl, h=25) ,as.dist(tmp), title=title(main = 'Good')) rownames(sil_cl) <- rownames(tmp) plot(sil_cl) ...

What you are asking for can be solved in multiple ways. Here are two: First way is to simply define the separating line of you clusters. Since you know how your points should be grouped (by a line) you can use that. If you want your line to start at...

matlab,opencv,cluster-analysis,point-clouds

This example looks like the very scenario DBSCAN was designed for. Lots of noise, but with a well understood density, and clusters with much higher density but arbitrary shape.

algorithm,cluster-analysis,bootstrapping,invariants,stability

You can find here a survey of clustering algorithms invariant under permutation in the data points and invariant under monotone transformation of similarity values: Batyrshin I., Rudas T. Invariant Hierarchical Clustering Schemes. In: Batyrshin, I.; Kacprzyk, J.; Sheremetov, L.; Zadeh, L.A. (Eds.). Perception-based Data Mining and Decision Making in Economics...

algorithm,cluster-analysis,k-means,hierarchical-clustering

You could try DBSCAN density-based clustering algorithm which is O(n log n) (garanteed ONLY in case of using indexing data structure like kd-tree, ball-tree etc, otherwise it's O(n^2)). Events that are part of a bigger event should be in a areas of high density. It seem to be a great...

r,plot,cluster-analysis,dendrogram,hclust

If you want to mimic the last few graphs, you can do something like this: N <- max(G) layout(matrix(c(0,1:N,0),nc=1)) gdist <- as.matrix(gdist) for (i in 1:N){ par(mar=c(0,3,0,7)) rc_tokeep <- row.names(subset(d, G==i)) if(length(rc_tokeep)>2){ #The idea is to catch the groups with one single element to plot them differently dis <- as.dist(gdist[rc_tokeep,...

Currently, the possibility to boost arbitrary words is not exposed in the API, so only the words included in the title can be promoted. The code that does the boosting is in: https://github.com/carrot2/carrot2/blob/master/core/carrot2-util-text/src/org/carrot2/text/vsm/TermDocumentMatrixBuilder.java#L159 You could add another attribute that would, for example, take a comma-separated list of words and boost...

csv,cluster-analysis,openrefine,data-cleaning

Did you try using the facet function? Facet group records based on an extact match. You can watch those video on faceting and data profiling.

r,machine-learning,cluster-analysis,som,unsupervised-learning

Map 1 is the average vector result for each node. The top 2 nodes that you highlighted are very similar. Map 2 is a kind of similarity index between the nodes. If you want to obtain such kind of map using the map 1 result you may have to develop...

The output of kmeans corresponds to the elements of the object passed as argument x. In your case, you omit the NA elements, and so $cluster indicates the cluster that each element of na.omit(x) belongs to. Here's a simple example: d <- data.frame(x=runif(100), cluster=NA) d$x[sample(100, 10)] <- NA clus <-...

r,vector,matrix,cluster-analysis,frequency

You can match V1, V2 etc against the unique levels then tabulate the results. uKmers <- levels(as.factor(arrayKmers)) freqKmers <- apply(arrayKmers, 2, function(x){ tabulate(match(x, uKmers), length(uKmers)) } ) > t(freqKmers) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 0 1 0 0 1 1 0 1 [2,] 0 1 0...

machine-learning,cluster-analysis,k-means,stocks

Yes it involves clustering. For the input companies collect profit, loss, customer feedbacks, demand of the product, years in business, etc.,...

cluster-analysis,weka,data-mining,k-means,rapidminer

Weka often uses built-in normalization at least in k-means and other algorithms. Make sure you have disabled this if you want to make results comparable. Also understand that k-means is a randomized algorithm. Different results even from the same package are to be expected (and desirable)....

To get a minimal (unfortunately not minimum) solution: First, greedily recluster any points that you can without violating the minimum size constraint. Next, build a directed multigraph as follows: Every cluster becomes a node. An edge (A,B) is added for every point in A that is closer to the center...

machine-learning,cluster-analysis,knn

You can try Meka which is based on Weka and is expected to handle multilabel classification problems.

cluster-analysis,data-mining,gaussian,elki

The delta parameter in EM is necessary to detect convergence. Since EM uses soft assignments internally, it will continue updating the values to arbitrary digits (technically, it will eventually run out of precision, and stop). As long as you choose a small enough value, you should be fine. However, EM...

r,plot,cluster-analysis,k-means

If myData is your data set, you can identify each group using a kmeans agorithm: (Make sure x and y are centered and normalized accordingly before) myData <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(myData) <- c("x", "y") (cl...

scikit-learn,cluster-analysis,data-mining,predict,dbscan

Clustering is not classification. Clustering is unlabeled. If you want to squeeze it into a prediction mindset (which is not the best idea), then it essentially predicts without learning. Because there is no labeled training data available for clustering. It has to make up new labels for the data, based...

python,scikit-learn,cluster-analysis,k-means

You can first calculate all the centroids using K-Means. Then compute euclidean distance from sklearn.metrics from every point to all the centroids (except those you want to exclude). Finally, get the cluster that minimizes the distance (np.argmin along 2nd axis) for each point.

cluster-analysis,data-mining,elki

ELKI is mostly used with numerical data. Currently, ELKI does not have a "mixed" data type, unfortunately. The ARFF parser will split your data set into multiple relations: a 1-dimensional numerical relation containing age a LabelList relation storing sex and region a 1-dimensional numerical relation containing salary a LabelList relation...

c++,opencv,cluster-analysis,k-means,feature-extraction

The result should always be [n x 15] if you have n images and k=15. But in the first run, you looked at the vocabular, not the feature representation of the first images. The 128 you are seeing there is the SIFT dimensionality; these are 15 "typical" SIFT vectors; they...

python-2.7,machine-learning,cluster-analysis,data-mining,k-means

The simplest way out of this is to add a column-specific prefix to all entries. Example of a parsed row: row = ["ID:1100", "AGE:25-34", "Occupation:IT", "Gender:M", "Product_range:50-60", "Product_cat:Gaming", "Product:XPS 6610"] There are many other ways around this, including splitting each row into a set of k-mers and applying the Jaccard-based...

machine-learning,cluster-analysis,k-means,unsupervised-learning

The k-means algorithm requires some initialization of the centroid positions. For most algorithms, these centroids are randomly initialized with some method such as the Forgy method or random partitioning, which means that repeated iterations of the algorithm can converge to vastly different results. Remember that k-means is iterative, and at...

Usually, one would start with 3 objects sampled from the actual data. This usually works much better than uniformly generated data. If you place 1 of the three centroids far away from your data, it may end up becoming empty, and your approach degenerates to 2-means. And I can always...

cluster-analysis,data-mining,dbscan,elki

If you use the -resulthandler ResultWriter and output to text, the cluster sizes will be at the top of each cluster file. The visualizer currently doesn't seem to show cluster sizes....

cluster-analysis,k-means,rapidminer

You don't describe clustering. But feature (word) selection for "cloud tag" classification. Have a look at decision trees, and the metrics used there to identify good features for splitting....

cluster-analysis,data-mining,weka,dbscan,elki

Accessing the DBIDs of ELKI works, if you pay attention to how they are assigned. For a static database, getDBIDs() will return a RangeDBIDs object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs will be...

matlab,matrix,cluster-analysis,k-means

If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here. How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and...

cluster-analysis,data-mining,dbscan

Yes, if used with Euclidean distance, then it is a radius. It is not infinitely small (it does not tend to 0). It's just supposed to be small compared to the data set extends, but the authors could have named it "r" instead. Use the original paper to understand the...

scikit-learn,cluster-analysis,logistic-regression

I would recommend clusters the log of the odds ratios for each of the explanatory variable. This way the models that don't have certain regressors you can fill in empty values with 0.0 (this can be done quite easily with pandas Assume you have a list of all the models...

This works: require(cluster) require(class) data(iris) ds <- iris ds$y <- as.numeric(ds$Species) ds$Species <- NULL idx <- rbinom(nrow(ds), 2, .6) training <- ds[idx,] testing <- ds[-idx,] x <- training y <- training$y x1 <- testing y1 <- testing$y clus <- clara(x, 3, samples = 1, sampsize = nrow(x), pamLike=TRUE) knn(train =...

python-2.7,cluster-analysis,hierarchical-clustering,outliers,dbscan

If finding the appropriate value of epsilon is a major problem, the real problem may be long before that: you may be using the wrong distance measure all the way, or you may have a preprocessing problem. Your code looks a lot like a naive preprocessing approach - and that...