KMeans.scala : package com.driver import com.driver.Functions._ object KMeans extends Application { case class Point(label: String, points: List[Double]) val k = 3 val points = List(("A1,2,10"), ("A2,2,5"), ("A3,8,4"), ("A4,5,8"), ("A5,7,5"), ("A6,6,4"), ("A7,1,2"), ("A8,4,9")) val toDouble: List[Point] = points.map(m => new Point(m.split(",").head, m.split(",").tail.map(m2 => m2.toDouble).toList)).toList var initialCenters: List[Point] = util.Random.shuffle(toDouble).take(k) val maxNumberOfIterations...

r,cluster-analysis,data-mining,k-means

You already have the information you want: kmeansRes<-factor(km$cluster) just add it to your data frame as additional column. df1$cluster <- km$cluster ...

It's not clear at all from your question what you want to do with kmeans, for instance how many clusters are you looking for? I recommend looking at the first example in the MATLAB ref guide For your data you can try e.g. X = [xtemp(:) ytemp(:)]; Nclusters = 3;...

python,numpy,scipy,scikit-learn,k-means

When you use Bag of Words, each of your sentences gets represented in a high dimensional space of length equal to the vocabulary. If you want to represent this in 2D you need to reduce the dimension, for example using PCA with two components: from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text...

If you give it a mismatching init it will adjust the number of clusters, as you can see from the source. This is not documented and I would consider it a bug. I'll propose to fix it.

image-processing,matrix,computer-vision,cluster-analysis,k-means

Cluster analysis is not the tool that you are looking for. Your requirements may be more of image processing nature than clustering. Clustering doesn't operate "visually" on matrixes. One cell is pretty much like another in clustering, and one mostly uses matrixes because there are fast implementations available, and you...

solr,cluster-analysis,k-means,workbench,carrot

Your suspicion is correct, it is a heap size problem, or more precisely, a scalability constraint. Straight from the carrot2 FAQs: http://project.carrot2.org/faq.html#scalability How does Carrot2 clustering scale with respect to the number and length of documents? The most important characteristic of Carrot2 algorithms to keep in mind is that they...

Trivial Training-validation-test Create two datasets from your labelled instances. One will be training set and the other will be validation set. The training set will contain about 60% of the labelled data and the validation will contain 40% of the labelled data. There is no hard and fast rule for...

This is the example from example(kmeans): # This is just to generate example data test <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(test) <- c("V1", "V2") #store the kmeans in a variable called cl (cl <- kmeans(test, 2)) #...

machine-learning,cluster-analysis,k-means,stocks

Yes it involves clustering. For the input companies collect profit, loss, customer feedbacks, demand of the product, years in business, etc.,...

python,scikit-learn,cluster-analysis,k-means

You can first calculate all the centroids using K-Means. Then compute euclidean distance from sklearn.metrics from every point to all the centroids (except those you want to exclude). Finally, get the cluster that minimizes the distance (np.argmin along 2nd axis) for each point.

machine-learning,data-mining,k-means

k-means is not a good choice, as it will not handle the 180° wrap-around, and distances anywhere but the equator will be distorted. IIRC in northern USA and most parts of Europe, the distortion is over 20% already. Similar, it does not make sense to use k-means on binary data...

python,python-2.7,scikit-learn,cluster-analysis,k-means

This is a simpler example: from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of...

First Question (are the centers part of my data?): No the centroids are not members of your data. They are randomly generated within the data set. It might happen that a centroid falls on a data point but that will be a coincidence and also the centroid will still...

cluster-analysis,data-mining,k-means,hierarchical-clustering,dbscan

Pearson correlation is not compatible with the mean. Thus, k-means must not be used - it is proper for least-squares, but not for correlation. Instead, just use hierarchical agglomerative clustering, which will work with Pearson correlation matrixes just fine. Or DBSCAN: it also works with arbitary distance functions. You can...

ruby,machine-learning,cluster-analysis,k-means,hierarchical-clustering

k-means is the wrong tool for this job anyway. It's not designed for fitting an exponential curve. Here is a much more sound proposal for you: Look at the plot, mark the three points, and then you have your three groups. Or look at quantiles... Report the median response time,...

r,plot,cluster-analysis,k-means

If myData is your data set, you can identify each group using a kmeans agorithm: (Make sure x and y are centered and normalized accordingly before) myData <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(myData) <- c("x", "y") (cl...

matlab,opencv,image-processing,computer-vision,k-means

After inspecting the images, I decided that a robust threshold will be more simple than anything. For example, looking at the C=40%, M=40% photo, I first inverted the intensities so black (the signal) will be white just using im=(abs(255-im)); we can inspect its RGB histograms using this : hist(reshape(single(im),[],3),min(single(im(:))):max(single(im(:)))); colormap([1...

You can apply a method to intelligently select initial centroids rather selecting them randomly. There are papers on improved K-Means algorithm. You can refer one or more of them and create your own improved K-Means algorithm....

java,exception,weka,data-mining,k-means

According to the Attribute-Relation File Format (ARFF): Lines that begin with a % are comments. (http://www.cs.waikato.ac.nz/ml/weka/arff.html) So given your moviedata.arff @data section, that could explain why there are no training instances being read in. In other words, when the exception says "Not enough training instances (required: 1, provided: 0)", it...

One basic metric to measure how "good" your clustering in comparison to your known class labels is called purity. Now this is an example of supervised learning where you have some idea of an external metric that is a labeling of instances based on real world data. The mathematical definition...

Try to use tryCatch to automate the the process of stopping when conversion is reached: I use the iris-data set because there kmeans needs 2 iterations (the (6,3.5)-Point switches) set.seed(1337) df = iris[,1:2] dfCluster<-kmeans(df,centers=3, iter.max = 1) plot(df[,1], df[,2], col=dfCluster$cluster,pch=19,cex=2, main="iter 1") points(dfCluster$centers,col=1:5,pch=3,cex=3,lwd=3) max_iter = 10 for (i in 2:max_iter){...

cluster-analysis,weka,data-mining,k-means,rapidminer

Weka often uses built-in normalization at least in k-means and other algorithms. Make sure you have disabled this if you want to make results comparable. Also understand that k-means is a randomized algorithm. Different results even from the same package are to be expected (and desirable)....

python,opencv,k-means,quantization

Change: ret,label,center=cv2.kmeans(Z,K,None,criteria,10,cv2.KMEANS_RANDOM_CENTERS) to: ret,label,center=cv2.kmeans(Z,K,criteria,10,cv2.KMEANS_RANDOM_CENTERS) (Deleted None)...

Your basic setup seems correct especially given that you are getting 85-95% accuracy. Now, it's just a matter of tuning your procedure. Unfortunately, there is no way to do this other than testing a variety of parameters examining the results and repeating. I going to break this answer into two...

It happens that you capture only the cluster element of the return value of kmeans, which returns also the centers of the clusters. Try this: #generate some data traindata<-matrix(rnorm(400),ncol=2) traindata=scale(traindata,center = T,scale=T) # Feature Scaling #get the full kmeans km.cluster = kmeans(traindata, 2,iter.max=20,nstart=25) #define a (euclidean) distance function between two...

It looks like you have two or more lists on a line somewhere meaning you're trying to evaluate two or more arrays (a sequence) as a single array. When I test this with two arrays separated by a comma then I get the same error as you. Try this to...

KMeans().predict(X) ..docs here Predict the closest cluster each sample in X belongs to. In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book. Parameters: (New data to predict) X : {array-like, sparse...

Your problem is that the document term matrix dtm should be indexed as a 2D matrix with the row and col. It should be findFreqTerms(dtm[cl$cluster==1,], 50) (notice the extra comma, we leave the second parameter empty so all columns are returned)...

Once you have a per-cluster array, recompute the "cluster center" as the average of the items in that cluster. Then, reassign every item to the appropriate cluster (the one whose "center" it's closest to after the recomputation). You're done when no item changes cluster (it's theoretically possible to end up...

python,machine-learning,scikit-learn,k-means

This will show how you can retrieve the cluster id for each row and the cluster centers. I have also measured the distance from each row to each centroid so you can see that the rows are properly assigned to the clusters. In [1]: import pandas as pd from sklearn.cluster...

Usually, one would start with 3 objects sampled from the actual data. This usually works much better than uniformly generated data. If you place 1 of the three centroids far away from your data, it may end up becoming empty, and your approach degenerates to 2-means. And I can always...

You could try a hierarchical clustering using a binary distance measure like jaccard, if "clicked a link" is asymmetrical: dat <- read.table(header = TRUE, row.names = 1, text = "User link1 link2 link3 link4 abc1 0 1 1 1 abc2 1 0 1 0 abc3 0 1 1 1 abc4...

r,statistics,cluster-analysis,k-means

The amount of variance explained is related to the two principal components calculated to visualize your data. This has nothing to do with the type of clustering algorithm or the accuracy of the algorithm that you're using (kmeans in this case). To understand how accurate your clustering algorithm is at...

postgresql,indexing,postgis,k-means

Per John Barça's suggestion, I used ST_DWithin with a GIST index on my geometries and timestamps and reduced the runtime to less than 10ms for the same query posted above. The only tricky piece realizing I needed degrees instead of meters for the geometry calculation (geographies can use meters). This...

java,filter,k-means,flink,apache-flink

I think the problem is the filter function (modulo the code you haven't posted). Flink's termination criterion works the following way: The termination criterion is met if the provided termination DataSet is empty. Otherwise the next iteration is started if the maximum number of iterations has not been exceeded. Flink's...

matlab,cluster-analysis,k-means

OK, now your question is more clear. I didn't understand what you meant in your other post. Alright, it looks like your T is your transmission alphabet. Be advised that the clusters that you get through k-means will probably not be the same as those from your transmission alphabet, so...

python-2.7,opencv,scipy,k-means,sift

Ok, apparently the problem is solved by changing a little bit the code and instead of the list comprehension I did: descriptors = np.array([]) for pic in train: kp, des = cv2.SIFT().detectAndCompute(pic, None) descriptors = np.append(descriptors, des) desc = np.reshape(descriptors, (len(descriptors)/128, 128)) desc = np.float32(desc) which works with the cv2...

You need to 'normalize' your data before calling kmeans. See for instance in the following code I have on purpose applied a scaling so that both income and age have similar range ageincome2=ageincome ageincome2[,1]=scale(ageincome2[,1]) ageincome2[,2]=scale(ageincome2[,2]) center=4 kk <- kmeans(ageincome2, center=center) plot(ageincome2, col = kk$cluster) points(kk$centers, col = 1:center, pch =...

python,numpy,scipy,scikit-learn,k-means

The following code should work. Let me know if this is not the case. import numpy as np from sklearn.decomposition import PCA from sklearn.cluster import KMeans search_terms = ['computer','usb port', 'phone adaptor'] clicks = [3,2,1] bounce = [0,0,2] conversion = [4,1,0] X = np.array([clicks, bounce, conversion]).T y = np.array(search_terms) num_clusters...

r,machine-learning,nlp,cluster-analysis,k-means

You could use the stringdist package to calculate the distance matrix: str <- c("patio furniture", "living room furniture", "used chairs", "new chairs") library(stringdist) d <- stringdistmatrix(str, str) stringdist supports a number of distance functions. The default is the 'restricted Damerau-Levenshtein distance'. You can then use this distance matrix in hclust...

python,numpy,matplotlib,scipy,k-means

The color= or c= property should be a matplotlib color, as mentioned in the documentation for plot. To map a integer label to a color just do LABEL_COLOR_MAP = {0 : 'r', 1 : 'k', ...., } label_color = [LABEL_COLOR_MAP[l] for l in labels] plt.scatter(x, y, c=label_color) If you don't...

hadoop,cluster-analysis,data-mining,mahout,k-means

How big is your data? If it is not exabytes, you would be better off without Mahout. If it is exabytes, use sampling, and then process it on a single machine. See also: Cluster one-dimensional data optimally? 1D Number Array Clustering Which clustering algorithm is suitable for one-dimensional Lists without...

python,scikit-learn,data-mining,k-means

You have many samples of 1 feature, so you can reshape the array to (13,876, 1) using numpy's reshape: from sklearn.cluster import KMeans import numpy as np x = np.random.random(13876) km = KMeans() km.fit(x.reshape(-1,1)) # -1 will be calculated to be 13876 here ...

algorithm,cluster-analysis,k-means,hierarchical-clustering

You could try DBSCAN density-based clustering algorithm which is O(n log n) (garanteed ONLY in case of using indexing data structure like kd-tree, ball-tree etc, otherwise it's O(n^2)). Events that are part of a bigger event should be in a areas of high density. It seem to be a great...

The output of kmeans corresponds to the elements of the object passed as argument x. In your case, you omit the NA elements, and so $cluster indicates the cluster that each element of na.omit(x) belongs to. Here's a simple example: d <- data.frame(x=runif(100), cluster=NA) d$x[sample(100, 10)] <- NA clus <-...

Your question isn't very clear and this might not be what you want, but I'll give it a try anyway. I think you have an array of RBG values and some centroids, maybe three, which are also RGB values. For each RBG value in the array you want to know...

python,numpy,scikit-learn,k-means

The default behavior of KMeans is to initialize the algorithm multiple times using different random centroids (i.e. the Forgy method). The number of random initializations is then controlled by the n_init= parameter: n_init : int, default: 10 Number of time the k-means algorithm will be run with different centroid seeds....

Yes this can happen. It depends on your implementation what is then happening. Some leave the cluster center as is, some reduce k, some choose a new cluster center, some crash badly....

matlab,matrix,cluster-analysis,k-means

If you don't "understand" the cluster centroid results, then you don't understand how k-means works. I'll present a small summary here. How k-means works is that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and...

If you limit the maximum distance, you are doing hierarchical clustering (more precisely, a single cut through the cluster tree), not k-medoids. It's common to do this with a distance matrix, too....

cluster-analysis,k-means,rapidminer

You don't describe clustering. But feature (word) selection for "cloud tag" classification. Have a look at decision trees, and the metrics used there to identify good features for splitting....

python-2.7,machine-learning,cluster-analysis,data-mining,k-means

The simplest way out of this is to add a column-specific prefix to all entries. Example of a parsed row: row = ["ID:1100", "AGE:25-34", "Occupation:IT", "Gender:M", "Product_range:50-60", "Product_cat:Gaming", "Product:XPS 6610"] There are many other ways around this, including splitting each row into a set of k-mers and applying the Jaccard-based...

scala,apache-spark,k-means,mllib

Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2. The argument received by the .map/.flatMap functions is an element of the original...

python,initialization,scipy,k-means

You can seed the default numpy random number generator, for example: from numpy import random random.seed(123) as shown in the last example here (which seems to be applicable to kmeans2 as well)....

algorithm,machine-learning,k-means

This algorithm revolves around the idea that points which are close to one cluster center and further away from all other cluster centers will stick to that cluster center. Thus, in subsequent iterations these points don't have to be compared to all cluster centers. Imagine a point P which has...

What you are asking for can be solved in multiple ways. Here are two: First way is to simply define the separating line of you clusters. Since you know how your points should be grouped (by a line) you can use that. If you want your line to start at...

excel,matlab,cluster-analysis,k-means,geo

I think you are looking for "path planning" rather than clustering. The traveling salesman problem comes to mind If you want to use clustering to find the individual regions you should find the coordinates for each location with respect to some global frame. One example would be using Latitude and...

You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more. If...

There are couple of work already done using clustering technique for geo data. Check this paper which explain how to use k-mean and density based clustering in geo-data. http://paginas.fe.up.pt/~prodei/dsie12/papers/paper_13.pdf Important step is we calculate Euclidean distance for 3D space(lat,long,timestamp). I hope this paper would help you to understand. Please go...

python-2.7,machine-learning,cluster-analysis,data-mining,k-means

Interpreted Python code is slow. Really slow. That is why the good python toolkits contain plenty of Cython code and even C and Fortran code (e.g. matrix operations in numpy), and only use Python for driving the overall process. You may be able to speed up your code substantially if...

python,gis,geospatial,latitude-longitude,k-means

Have you tried kmeans? the issue raised in the liked question seems to be with points that are close to 180 degrees. If your points are all close enough together (like in the same city or country for example) then kmeans might work OK for you.

There are several things going on here. First, you should be aware that kmeans(dist,n) uses an algorithm that defines n cluster centroids at random and then moves them around until it's minimization criteria are met. This often leads to a local minimum, which in turn means that if you run...

cluster-analysis,data-mining,k-means

Here is a tutorial on how to perform such a k-means modification: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/SameSizeKMeans It's not exactly what you need, but a closer k-means variant that can be easily adapted to your needs. Plus, it is a walkthrough tutorial....

python,machine-learning,cluster-analysis,k-means

As per my knowledge clustering becomes very memory intensive as the size increases, you will have to figure out a way to reduce the dimensionality of your data. I am not familiar with ROCK but I've worked on clustering problems before wherein I had to cluster millions of documents. Distance...

OpenCV weirdly interprets one-dimensional data as a 1 element array. Something like following should fix the behavior: kmeans(cv::Mat(values).reshape(1, values.size()), K, labels, TermCriteria(TermCriteria::COUNT + TermCriteria::EPS, 10, 1.0), 10, KMEANS_PP_CENTERS, centers); ...

machine-learning,cluster-analysis,k-means,unsupervised-learning

The k-means algorithm requires some initialization of the centroid positions. For most algorithms, these centroids are randomly initialized with some method such as the Forgy method or random partitioning, which means that repeated iterations of the algorithm can converge to vastly different results. Remember that k-means is iterative, and at...

machine-learning,cluster-analysis,k-means,hierarchical-clustering

The idea I'm suggesting is originated in text-processing, NLP and information retrieval and very widely used in situations where you have sequences of characters/information like Genetic information. Because you have to preserve the sequence, we can use the concepts of n-grams. I'm using bi-grams in the following example, though you...

cluster-analysis,data-mining,k-means,elki

5 seconds is probably too fast to get you a reliable measurement. Furthermore, with a larger k the result may converge with fewer iterations. You may want to use -time to see the time needed to run the actual algorithm. With your method, parsing and writing will have a non-neglible...

You can use a colormap. Let R1, B1 and G1 be the RGB values you want to dispaly the first cluster in (values in range [0..1]), and R2 be the red channel value for the second cluster and so on... Then your color map is: cmp = [R1 G1 B1;...

To simply choose the K most further apart entities as initial centroids is rather dangerous. Real-world data sets tend to have outliers, under your approach these would be chosen as initial centroids. There are many initialization algorithms for K-Means, perhaps you would like to take a look at intelligent K-Means....

python,scipy,scikit-learn,nltk,k-means

You should take a look at how the Kmeans algorithm works. First the stop words never make it to vectorized, therefore are totally ignored by Kmeans and don't have any influence in how the documents are clustered. Now suppose you have: sentence_list=["word1", "word2", "word2 word3"] Lets say you want 2...

The error: Exception in thread "main" weka.core.WekaException: weka.clusterers.SimpleKMeans: Cannot handle any class attribute! states that SimpleKMeans cannot handle a class attribute. This is because K-means is an unsupervised learning algorithm, meaning that there should be no class defined. Yet, one line in the code sets the class value. If you...

python,performance,numpy,cluster-analysis,k-means

The fastest you'll get with the numpy/scipy stack is a specialized function just for this purpose scipy.spatial.distance.cdist. scipy.spatial.distance.cdist(XA, XB, metric='euclidean', p=2, ...) Computes distance between each pair of the two collections of inputs. It's also worth noting that scipy provides kmeans clustering as well. scipy.cluster.vq.kmeans...

Your main question seems to be how to calculate distances between a data matrix and some set of points ("centers"). For this you can write a function that takes as input a data matrix and your set of points and returns distances for each row (point) in the data matrix...

python,classification,cluster-analysis,data-mining,k-means

there are two related (but distinct technique-wise) questions here; the first is relates to choice of clustering technique for this data. the second, predicate question relates to the data model--ie, for each sentence in the raw data, how to transform it to a data vector suitable for input to a...

machine-learning,scikit-learn,k-means,random-forest

Be careful in turning a list of string values into categorical ints as the model will likely interpret the integers as being numerically significant, but they probably are not. For instance, if: 'Dog'=1,'Cat'=2,'Horse'=3,'Mouse'=4,'Human'=5 Then the distance metric in your clustering algorithm would think that humans are more like mice than...

matlab,image-processing,machine-learning,octave,k-means

edge in MATLAB / Octave returns a binary / logical matrix. kmeans requires that the input be a double or single matrix. Therefore, simply cast ed to double and continue: ed=edge(de,"canny"); imshow(ed); ed = double(ed); %// Change j=kmeans(ed,3); ...

You can use ddply from plyr to do this easily. library(plyr) ddply(df,.(cluster),summarise,variance1 = var(V1),variance2 = var(V2),mean1 = mean(V1),...) You can also do it this way, ddply(df,.(cluster),function(x){ res = c(as.numeric(colwise(var)(x)),as.numeric(colwise(mean)(x))) names(res) = paste0(rep(c('Var','Mean'),each = 4),rep(1:4,2)) res }) ...

You will need to instantiate the input fields used by the k-means model. You do this by adding a 'Type' node before the modeling node and after any field operation node that would make compute or change any of the nodes that are used as input to the model. In...

c++,opencv,cluster-analysis,k-means,feature-extraction

The result should always be [n x 15] if you have n images and k=15. But in the first run, you looked at the vocabular, not the feature representation of the first images. The 128 you are seeing there is the SIFT dimensionality; these are 15 "typical" SIFT vectors; they...