xml,xslt,pattern-matching,string-matching,similarity

Assuming XSLT 2.0 the following takes the first XML document as its input and the second document's URL as a parameter and then outputs for each element name in the first document a list of names that are contained or contain the name in the second: <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"...

python,pandas,distance,similarity,categorical-data

If you're concerned or reliant on categorical then another approach is to define your categories in a list and an order, use this to create a dict to map the order to the categories and pass this dict to map: In [560]: df = pd.DataFrame({'id' : range(1,9), 'rate' : ['bad',...

Using C#'s LINQ ... Say you have an array of doubles named A and another named B. This will give you the Jaccard index: var CommonNumbers = from a in A.AsEnumerable<double>() join b in B.AsEnumerable<double>() on a equals b select a; double JaccardIndex = (((double) CommonNumbers.Count()) / ((double) (A.Count() +...

sql,postgresql,pattern-matching,similarity

I am assuming you are the only one writing to the tables, so no concurrency conflicts ... Add the ID column to table2, can be NULL for now: ALTER TABLE table2 ADD COLUMN citizenid int; -- or whatever the type is Consider an additional flag on table1 to take row...

java,lucene,string-matching,similarity

They're all good. They work on different properties of strings and have different matching properties. What works best for you depends on what you need. I'm using the JaccardSimilarity to match names. I chose the JaccardSimilarity because it was reasonably fast and for short strings excelled in matching names with...

This defines a mismatch score between each of the two elements of each row of Annotation and applies it giving a score to each row: a <- Annotation ch <- replace(a, TRUE, lapply(a, sub, pat = " *$", replace = "")) # rm trailing spaces w <- lapply(ch, strsplit, "...

excel,vba,function,excel-vba,similarity

The Like operator is often confused. It doesn't simply check if two strings match. Instead, the value to the left of the Like operator is the string to test against the pattern, which is to the right of the operator. String_To_Test Like Pattern With that said, an expression such as:...

python,similarity,locality-sensitive-hash

a. It's a left shift: https://docs.python.org/2/reference/expressions.html#shifting-operations It shifts the bits one to the left. b. Note that ^ is not the "to the power of" but "bitwise XOR" in Python. c. As the comment states: it defines "number of bits per signature" as 2**10 → 1024 d. The lines calculate...

php,arrays,algorithm,similarity

You don't really need 2 arrays instead of the foreach loop without $key wildcard you can use it with $key and skip the solver when the $key is the same. Then you also avoid dupes. foreach ($arrayOne as $key => $rowOne) { foreach($arrayOne as $ikey => $rowTwo){ if ($ikey !=...

sql,ms-access,detect,similarity

Say we have sample data in a table named [myData]: FID PC1 PC2 PC3 PC4 --- --- --- --- --- 1 1 3 5 2 2 4 4 4 0 3 5 3 1 1 4 9 9 8 7 We use some formula to give each row a "rating"...

r,data.frame,compare,similarity

I Guess your question is to find the all of rows in df2 that not in df1 or all of rows in df2 that in df1. If that you mean, you can use sqldf library library(sqldf) df2NotIndf1 <- sqldf('SELECT * FROM df2 EXCEPT SELECT * FROM df1') df2Indf1 <- sqldf('SELECT...

You have a few problems here. You can't use column aliases in the where clause, you are trying to assign a column alias on the wrong side of the parenthesis, you can't feed a set to the second argument of similarity, and you've just generally mangled the syntax in several...

Take a look at jsdifflib, a javascript implementation of python's SequenceMatcher. You can get the similar percentage: difflib.ratio(string1, string2) * 100. Here is the demo. Hope this is what you want.

opencv,similarity,template-matching

Perhaps I should apologize first: the value of the matrix is indeed zero after normalization, as long as the two pictures are of the same size. I was wrong about that:) Check out this page: OpenCV - Normalize Part of the OpenCV source code: void cv::normalize( InputArray _src, OutputArray _dst,...

Compare echo levenshtein("Femenino", "FEMENINO"); // 7 VS echo levenshtein(strtolower("Femenino"), strtolower("FEMENINO")); //0 If alphabet case doesn't matter for your application, make both the strings same case before you compare and you'll get significant improvement....

distance,shape,similarity,euclidean-distance

What distance function to use depends on your concrete task. I guess cosine similarity may be what you want....

I'm not aware of any packages, but this implements your own metric (from your comment): siml <- function(x,y) { length(intersect(lst[[x]],lst[[y]]))/length(union(lst[[x]],lst[[y]])) } z <- expand.grid(x=1:length(lst),y=1:length(lst)) result <- mapply(siml,z$x,z$y) dim(result) <- c(length(lst),length(lst)) result # [,1] [,2] [,3] [,4] [,5] [,6] # [1,] 1.000 0.8 0.667 0.667 0.8 0.667 # [2,] 0.800 1.0...

machine-learning,svm,similarity

Yes, they can - you are describing a new classification problem. Your input is simply now twice as large as before (the two feature vectors concatenated together) and the class labels are "same" and "not same". ie: your feature vectors may have been [a, b] and [x, y] for two...

machine-learning,cluster-analysis,similarity,unsupervised-learning

First, compute your profiles as you already did. Then the crucial step will be some kind of normalization. You can either divide the numbers by their total so that the numbers sum up to 1, or you can divide them by their Euclidean norm so that they have Euclidean norm...

r,pattern-matching,similarity,r-factor

You can do in some steps using regular expressions: dat$Freq <- as.numeric(dat$Freq) dat$Com[grep('.*(C).*(C[++]).*',dat$Com)] <- 'ccplusplus' dat$Com[grep('C[++]',dat$Com)] <- 'cplusplus' dat$Com[grep('C',dat$Com)] <- 'c' tapply(dat$Freq,dat$Com,sum) # c ccplusplus cplusplus # 11 15 6 ...

elasticsearch,lucene,similarity

Lucene uses the 'practical score function' mentioned in your link, which is an approximation of the cosine similarity - extended to support 'practical' features such as boosts. If you take the vector space cosine similarity formula for a query q and a document d, you have: s(q, d) = q...

machine-learning,fft,data-mining,similarity,feature-extraction

Generally you should extract just a small number of "Features" out of the complete FFT spectrum. First: Use the log power spec. Complex numbers and Phase are useless in these circumstances, because they depend on where you start/stop your data acquisiton (among many other things) Second: you will see a...

mysql,sql,sorting,order,similarity

You want to sort hex codes by wavelength, this roughly maps onto the hue-value. Given a hexcode as a six character string: RRGGBB. You just need to make a function that takes in a hexcode string and outputs the hue value, here's the formula from this Math.SO answer: R' =...

You can always use unset( $counts[$key_two] ) ; ...

You have to find the maximum possible distance between a string of length n and a string of length m. If for example this maximum distance is n + m then the percentage will be 100 - 100 * edit_distance(a, b) / (a.length + b.length) If for example you use...

I would recommend looking into the networkx package. It allows you to create directed graphs such as you are describing. For measuring the similarity of 2 routes, I would recommend the Jaccard similarity index. Here is code showing the example you illustrated. First, import a few libraries: graphs, plotting, and...

java,similarity,recommendation-engine,mahout-recommender,collaborative-filtering

First, in both methods - user-to-user and item-to-item we have defined two functions: similarity and predict. The similarity measures for You how close are two entities to each other (users or items). In Your case - Tanimoto was chosen. What You are missing is predict function. As You have got...

c#,return-value,similarity,track

Try using a tuple to store line (or line number) in the list. var d = new List<Tuple<string, double>>(); . . . while ((line = file.ReadLine()) != null) { d.Add(Tuple.Create(line, p.calculate_CS(line, document))); } . . . foreach (double item in d.OrderByDescending(t => t.Item2)) { fileW.WriteLine("{0} from line {1}", item.Item2, item.Item1);...

So the following is the answer from akrun : first changing the blank cells to NA's is.na(my.matrix) <- my.matrix=='' and then removing the NA's for the match.counts similarity.matrix <- matrix(nrow=ncol(my.matrix), ncol=ncol(my.matrix)) for(col in 1:ncol(my.matrix)){ matches <- my.matrix[,col] == my.matrix match.counts <- colSums(matches, na.rm=TRUE) match.counts[col] <- 0 similarity.matrix[,col] <- match.counts }...

machine-learning,mahout,similarity,recommendation-engine,mahout-recommender

I'm not familiar with all of them, but I can help with some. Cooccurrence is how often two items occur with the same user. http://en.wikipedia.org/wiki/Co-occurrence Log-Likelihood is the log of the probability that the item will be recommended given the characteristics you are recommending on. http://en.wikipedia.org/wiki/Log-likelihood Not sure about tanimoto...

r,data.frame,similarity,manipulation

You can use dist: dist(t(cbind(unlist(DF1), unlist(DF2))), "binary") # 0.2857143 The distance would be 1 for DF2 <- as.data.frame(xor(DF1, 1) +0L) and 0 for DF2 <- DF1. ...