Menu
  • HOME
  • TAGS

stemming words in python

python,nlp,stemming

For each suffix in the given list you can check if the given word ends with any of the given suffixes, if yes the remove the suffix, else return the word. suffixes = ['ing'] def stem(word): for suff in suffixes: if word.endswith(suff): return word[:-len(suff)] return word print(stem ('having')) >>> hav...

StanfordCoreNLP does not work in my way

java,nlp,stanford-nlp,stemming,lemmatization

Lemmatization should ideally return a canonical form (known as 'lemma' or 'headword') of a group of words. This canonical form, however, is not always what we intuitively expect. For example, you expect "learning" to be yield the lemma "learn". But the noun "learning" has the lemma "learning", while only the...

Are Snowball & SnowballC packages different in R?

r,stemming,tm,snowball

You have an older version of pkg:tm. The current version of tm has a DESCRIPTION file that lists SnowballC as a "Suggests". Older versions suggested Snowball. Package: tm Title: Text Mining Package Version: 0.5-10 Date: 2014-01-07 [email protected]: c(person("Ingo", "Feinerer", role = c("aut", "cre"), email = "[email protected]"), person("Kurt", "Hornik", role =...

Terms get truncated after indexing document (Elasticsearch)

elasticsearch,stemming

This is the effect of the stemmer. By default the snowball stemmer is also used as analyzer. The expected behaviour of stemmer is to convert words to its base form , like below - Jumping => jump Running = > run And so on. snowball stemmer works on an algorithm...

Greek words stemming Lucene

lucene,stemming

You don't really need to do anything with GreekStemFilter directly. Just use GreekAnalyzer. It includes the stemmer.

Why cowardly becomes cowardli after stemming?

nlp,nltk,stemming

First let's look at the different stemmers/lemmatizer that NLTK has: >>> from nltk import stem >>> lancaster = stem.lancaster.LancasterStemmer() >>> porter = stem.porter.PorterStemmer() >>> snowball = stem.snowball.EnglishStemmer() >>> wnl = stem.wordnet.WordNetLemmatizer() >>> word = "cowardly" >>> lancaster.stem(word) 'coward' >>> porter.stem(word) u'cowardli' >>> snowball.stem(word) u'coward' >>> wnl.stem(word) Traceback (most recent call...

Is it possible to get a natural word after it has been stemmed?

nlp,stemming,porter-stemmer

Stemmer is able to process artificial non-existing words. Would you like them to be returned as elements of a set of all possible words? How do you know that the word doesn't exist and shouldn't be returned? As an option: find a dictionary of all words and their forms. Find...

How to use/call stemmer (croatian stemmer) [closed]

python,call,stemming

The answer is written in your file : python Croatian_stemmer.py input_file output_file ...

StanfordNLP lemmatization cannot handle -ing words

java,nlp,stanford-nlp,stemming,lemmatization

Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma. In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing...

How to correctly configure solr stemming

solr,stemming

I would also suggest to try a different Stemmer. There are 4 included in Solr solr.PorterStemFilterFactory solr.SnowballPorterFilterFactory solr.KStemFilterFactory solr.HunspellStemFilterFactory (you will need a dictionary for this one from an external source, like open office) Each of those produces different results for your problem, see below. Given the results and that...

Stemming text in java [duplicate]

java,lucene,stemming

Make this: public static String stem(String string) throws IOException { TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_47, new StringReader(string)); tokenizer = new StandardFilter(Version.LUCENE_47, tokenizer); tokenizer = new LowerCaseFilter(Version.LUCENE_47, tokenizer); tokenizer = new PorterStemFilter(tokenizer); CharTermAttribute token = tokenizer.getAttribute(CharTermAttribute.class); tokenizer.reset(); StringBuilder stringBuilder = new StringBuilder(); while(tokenizer.incrementToken()) { if(stringBuilder.length()...

Porter Stemming of fried

nlp,nltk,stemming,porter-stemmer

A stem as returned by Porter Stemmer is not necessarily the base form of a verb, or a valid word at all. If you're looking for that, you need to look for a lemmatizer instead.

Why did PortStemmer in NLTK converts my “string” into u“string”

python,nltk,sax,stemming

As @kindall noted, it's becaus eof the unicode string. But more specifically, it's because NLTK uses from __future__ import unicode_literals which converts ALL strings to unicode by default, see https://github.com/nltk/nltk/blob/develop/nltk/stem/porter.py#L87 So let's try an experiment in python 2.x : $ python >>> from nltk.stem import PorterStemmer >>> porter = PorterStemmer()...

Logical flaw: if List is null return input else print function output

java,null,wordnet,stemming,data-processing

That won't work because the List is not null, the List is empty. You have to do the check like this if (stemmed_words.size() > 0)

UnicodeDecodeError unexpected end of data while stemming over dataset

python,unicode,pandas,nltk,stemming

Read up on unicode string processing in python. There is the type str but there is also a type unicode. I suggest to: decode each line immediately after reading, to narrow down incorrect characters in your input data (real data contains errors) work with unicode and u" " strings everywhere....

Confused about priority between stemmer and pos tagger

python,nltk,stemming,part-of-speech

Why don't you try it out? Here's an example: >>> from nltk.stem import PorterStemmer >>> from nltk import word_tokenize, pos_tag >>> sent = "This is a messed up sentence from the president's Orama and it's going to be sooo good, you're gonna laugh." This is the outcome of tokenizing. >>>...

How to split a text into two meaningful words in R

r,string-split,stemming,text-analysis

Given a list of English words you can do this pretty simply by looking up every possible split of the word in the list. I'll use the first Google hit I found for my word list, which contains about 70k lower-case words: wl <- read.table("http://www-personal.umich.edu/~jlawler/wordlist")$V1 check.word <- function(x, wl) {...

Are there any Lucene stemmers that handle Shakespearean English?

solr,lucene,nlp,stemming

It looks like the rule changes are minimal for that. So, it might be possible to copy/modify the PorterStemmer class and related Factories/Filters. Or it might be possible to add those specific rules as Regular expression filter before Porter....

Difference in handling possessive (apostrophes) with english stemmer between 1.2 and 1.4

elasticsearch,lucene,stemming,porter-stemmer

I think this is the change, introduced in 1.3.0: The StemmerTokenFilter had a number of issues: english returned the slow snowball English stemmer porter2 returned the snowball Porter stemmer (v1) Changes: english now returns the fast PorterStemmer (for indices created from v1.3.0 onwards) porter2 now returns the snowball English stemmer...

Default english stemming in SOLR

solr,stemming

Your query doesn't look right, try to query it like this: http://localhost:8983/solr/collection1/select?q=name:walking&wt=json&indent=true ...

Lucene - default search lemmatization/stemming

java,lucene,search-engine,stemming,lemmatization

The sample referred in your post uses Lucene StandardAnalyzer which does not do stemming. If you want to use stemming, you need to use an other Analyzer implementation e.g.: SnowballAnalyzer ...

Mongodb stemming text

text,stemming

Snowball is a commonly used open source stemming approach, with implementations/ports for many (if not most) programming languages. If you only want stemming for your application, you should use the Snowball library directly. MongoDB 2.4+ uses Snowball internally for text stemming & indexing, but does not provide a separate API...

FastVectorHighlighter phrase highlighting not working with stemming

java,solr,lucene,stemming,fast-vector-highlighter

Turns out hl.fragSize wasn't set to a large enough value to include the entire highlighted sequence. The silly problems are often the worst.