postgresql,levenshtein-distance

There is a significant conceptual error with your question; GROUP BY takes certain equivalence relations (in the mathematical sense) as an argument and uses that to partition a SQL relation into equivalence classes. The problem is that the relation you describe, namely, "are two strings within a certain edit distance...

levenshtein-distance,finite-automata

Can someone explain me the idea and the basic functionality of a levenshtein automata in easy words? A deterministic finite automaton (DFA) is an alphabet (set of possible input characters) a set of states (just abstract objects with no special properties) a transition function (given any state and an...

You could adapt the algorithm from wikipedia as following: size_t levenshtein_distance(const std::string &s1, const std::string &s2) { const size_t substitution = 5; const size_t deletion = 2; const size_t insertion = 3; const size_t len1 = s1.size(), len2 = s2.size(); vector<vector<unsigned int> > d(len1 + 1, vector<unsigned int>(len2 + 1));...

r,matching,string-matching,levenshtein-distance,fuzzy-search

You can merge these data frames z <- merge(df1,df2,by='release_date',suffixes=c('.df1','.df2')) which will give you a cartesian product (i.e. all possible combinations between df1 and df2 for the same release_date, and then calculate the Levenshtein distance by: z$L.dist <- lenvenshteinSim(z$title.df1,z$title.df2) Having z$L.dist, you can filter the desired rows: subset(z,L.dist > 0.85) Update...

python,algorithm,levenshtein-distance

You can use this module - there is an editops function there, which returns a list with the operations needed to transform one string to another. Example: Levenshtein.editops("FBBDE", "BCDASD") [('delete', 0, 0), ('replace', 2, 1), ('insert', 4, 3), ('insert', 4, 4), ('replace', 4, 5)] From the docs: Find sequence of...

perl,sorting,levenshtein-distance

Custom sorts take two inputs, perform a 'comparison' and respond with -1, 0 or 1 depending on whether they are before, after or equal. Sorting is designed for making a positional order, not really for 'grouping stuff that's vaguely similar'. You do have the Text::Levenshtein module which quickly lets you...

algorithm,cluster-analysis,spell-checking,levenshtein-distance

Do not use clustering. It will produce a lot of false positives. It will consider “Sam” and “Pam” highly similar. Instead look at spelling correction, or define a Levenshtein distance threshold. But something that considers typo behavior will work even better than such a native letter approach ....

python,string,unicode,levenshtein-distance,edit-distance

According to its documentation, it supports unicode: It supports both normal and Unicode strings, but can't mix them, all arguments to a function (method) have to be of the same type (or its subclasses). You need to make sure the Chinese characters are in unicode though: In [1]: from Levenshtein...

r,string-matching,levenshtein-distance

See ?stringdistmatrix. Specifically, read what it says by the weights argument. You'll see Weights must be positive and not exceed 1. In this case, the error is telling you that one weight is not positive. You have one of the weights set to 0. Once you fix that, you'll still...

php,arrays,levenshtein-distance

Use usort() (http://us2.php.net/manual/en/function.usort.php), and have your compare function use the Levenshtein distance. If that value is expensive to compute, then pre-compute it and store it in a separate array so that your compare function doesn't have to recompute it frequently during the sort....

python,grouping,levenshtein-distance

You are removing items from the list you are iterating over. That will change the indexes and cause unexpected behavior you should copy the values you need to a new list rather than modifying the list you are iterating on.

php,search,levenshtein-distance

I ended up using a regular expression to find and replace any whitespace with a " " in the string so it would tokenize properly. $pregTokens = $pregText = preg_replace('/\s+/', ' ', $primaryResult['ent_posts_content']); $postTokens = explode(' ', $pregTokens); ...

php,mysql,full-text-search,levenshtein-distance

In general, the correct answer to "Which is faster" is "Try it on your system and your data". In this case, that is the best answer. I do not know off-hand which system (MySQL or PHP) has a faster implementation of Levenshtein. Probably nobody does, because this depends on system...

The function adist computes a generalized Levenshtein distance. Is that what you need? Assuming you have your data in a data.frame, using: adist(mydf$Str) will return a matrix with the distances between each pair of the Str column....

search,indexing,solr,levenshtein-distance

I wonder what keeps you from trying it with Solr, as Solr provides much of what you need. You can declare the field as type="string" multiValued="true and save each list item as a value. Then, when querying, you specify each of the items in the list to look for as...

php,sorting,levenshtein-distance

See the answer to this SO question, they have a similar need but have structured their data in a way that makes the answer easier. It looks like PHP supports sorting by multiple attributes (in descending priority) as long as those attributes are built into the associative array that's being...

java,string,levenshtein-distance

I am not sure if there is a java implementation for it, but you can find the implementation for your algorithm here: http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Java good luck :)...

performance,matlab,levenshtein-distance

Function 'strdist' is not an inbuilt matlab function, so I guess you took if from the File Exchange. That's also why both your approaches are roughly equal in time, cellfun internally just expands into a loop. If strdist is symmetric, i.e. strdist(a,b)==strdist(b,a) you can however save half the computations. This...

python,r,levenshtein-distance,edit-distance,eye-tracking

You need a version of the Wagner-Fisher algorithm that uses non-unit cost in its inner loop. I.e. where the usual algorithm has +1, use +del_cost(a[i]), etc. and define del_cost, ins_cost and sub_cost as functions taking one or two symbols (probably just table lookups).

dynamic-programming,levenshtein-distance,edit-distance

If i understand you correctly, I think the answer is yes, the Levenshtein edit distance could be different than an algorithm that only allows deletions and substitutions to the larger string. Because of this, you would need to modify, or create a different algorithm to get your limited version. Consider...

c++,string,levenshtein-distance

You can use the algorithm to determine the edit distance between the two strings and if that distance is less than a certain threshold you would consider it a match. The trick would be determining the threshold. I have not played with this much, but one option that comes to...

string,r,vectorization,levenshtein-distance,stringr

stringdist has a version for computing all the distances in a matrix, so I think that something like this will be an improvement, it's about four times as quick on my computer when run with the 100 reps line included: d <- c("herb", "market", "merchandise", "fun", "casket93", "old", "herbb", "basket",...

In Machine Learning, a cost function is the function that you're trying to minimize to achieve the best result. When the machine execute, for instance, Steps A, B and C, the Cost Function will calculate how much it 'costed' to execute those steps. The term cost is connected to a...