Menu
  • HOME
  • TAGS

determination of human language from text:: system structure [closed]

java,nlp,language-detection

That would work as a first approximation. The problem with fixed word lists for language detection, though, is that real texts (and especially short ones) may not provide enough hits in your list. A more reliable approach would collect parts of other language features (like statistics of letter n-grams that...

How to Break a sentence into a few words

python-2.7,parsing,nlp,nltk

Without using Natural Language Toolkit(NLTK) you may use simple Python command as follows. >>> line="a sentence with a few words" >>> line.split() ['a', 'sentence', 'with', 'a', 'few', 'words'] >>> given in Split string into a list in Python ...

extracting n grams from huge text

python,performance,nlp,bigdata,text-processing

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here): with open('my100GBfile.txt') as corpus: for line in corpus: sequence = preprocess(line) extract_n_grams(sequence) Let's assume that your corpus doesn't need any special treatment. I guess you can find...

CoreNLP API for N-grams?

nlp,stanford-nlp,n-gram,pos-tagger

If you are coding in Java, check out getNgrams* functions in the StringUtils class in CoreNLP.

Format an entire text with pattern.en?

python,machine-learning,nlp

This should work for you s = "Hello, how is it going ? I am tired actually, did not sleep enough... That is bad for work, definitely" s = parse(s) #Create a list of all the tags you don't want dont_want = ["UH", "PRP"] sentence = parse(s).split(" ") #Go through...

What is the default behavior of Stanford NLP's WordsToSentencesAnnotator when splitting a text into sentences?

nlp,stanford-nlp

It does split on these characters, however only when they appear as their own token and not at the end of an abbreviation such as in "etc.". So the issue here is not the sentence splitter but the tokenizer which thinks that "N." is an abbreviation and therefore does not...

Convert nl string to vector or some numeric equivalent

javascript,string,nlp

I'm sure you've considered assigning each new word you encounter an integer. You'll have to keep track somewhere, but that's one option. You could also use whatever built-in hash method js has. If you don't mind a few hash collisions, and the size of the resulting integers doesn't matter, may...

TF - IDF vs only IDF

nlp,ranking,tf-idf

TF in TF-IDF means frequency of a term in a document. In other words, TF-IDF is a measure for both the term and the document. Here is a good illustration of what I mean. As far as I understand your case, you don't work with any particular document, instead you...

Natural Language Search (user intent search)

nlp,search-engine,keyword,voice-recognition,naturallyspeaking

See the following example: "find me an android phone with 4 gb ram and at least 16 gb storage." First of all you need a list of words that you can directly extract from the input and insert in your search query.This is the easy part. "find me an android...

Annotator dependencies: UIMA Type Capabilities?

java,annotations,nlp,uima,dkpro-core

In UIMA the primary way in which one assures that the annotations you need are present for your annotator is to aggregate annotators together. So to answer your question, that is how you are going to achieve what you want because what you want to do (have UIMA figure our...

lingpipe sentiment analysis tutorial demo error?

java,eclipse,nlp,lingpipe

The specified path file:///Users/dylan/Desktop/POLARITY_DIR/ should contain the unpacked data (see tutorial) in the directory txt_sentoken You can see this in the output: Data Directory=file:/Users/dylan/Desktop/POLARITY_DIR/txt_sentoken Also the the tutorial is not setup to use an URL, so the command should be java -cp sentimentDemo.jar:../../../lingpipe-4.0.jar PolarityBasic /Users/dylan/Desktop/POLARITY_DIR...

Stanford NLP: Chinese Part of Speech labels?

python,nlp,stanford-nlp,pos-tagger,part-of-speech

We use the tag set of the (Penn/LDC/Brandeis/UC Boulder) Chinese Treebank. See here for details on the tag set: http://www.cis.upenn.edu/~chinese/ This was documented in the parser FAQ, but I'll add it to the tagger FAQ....

difference between Latent and Explicit Semantic Analysis

machine-learning,nlp

The difference between Latent Semantic Analysis and so-called Explicit Semantic Analysis lies in the corpus that is used and in the dimensions of the vectors that model word meaning. Latent Semantic Analysis starts from document-based word vectors, which capture the association between each word and the documents in which it...

How to iterate through the synset list generated from wordnet using python 3.4.2

python-3.x,nlp,wordnet,sentiment-analysis

You can get the name and the pos tag of each synset like this: from nltk.corpus import wordnet as wn synonyms = wn.synsets('good','a') for synset in synonyms: print(synset.name()) print(synset.pos()) The name is the combination of word, pos and sense, such as 'full.s.06'. If you just want the word, you can...

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

python,nlp,nltk,corpus,tagged-corpus

Since you just want the POS of loose words you don't need ngrams, you need a tagged corpus. Assuming your corpus is already tagged, you can do it like this. >>> from nltk.corpus import brown >>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words()) >>> wordcounts["set"].tabulate(10) VBN VB NN VBD VBN-HL NN-HL 159 88 86...

Python convert list of multiple words to single words

python,nlp,nltk

You can join using a space separator and then split again: In [22]: words = ['one','two','three four','five','six seven'] ' '.join(words).split() Out[22]: ['one', 'two', 'three', 'four', 'five', 'six', 'seven'] ...

Create Dictionary from Penn Treebank Corpus sample from NLTK?

python,dictionary,nlp,nltk,corpus

Quick solution: >>> from nltk.corpus import treebank >>> from nltk import ConditionalFreqDist as cfd >>> from itertools import chain >>> treebank_tagged_words = list(chain(*list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()])))) >>> wordcounts = cfd(treebank_tagged_words) >>> treebank_tagged_words[0] (u'Pierre', u'NNP') >>> wordcounts[u'Pierre'] FreqDist({u'NNP': 1}) >>> treebank_tagged_words[100] (u'asbestos', u'NN')...

Way to dump the relations from Freebase?

nlp,semantic-web,freebase,dbpedia,wikidata

Freebase was retired and Wikidata is the recommended alternative. You can use the Wikidata Query API to get entities with a specific property. For instance, the query http://wdq.wmflabs.org/api?q=CLAIM[26] retrieves the IDs of all items having the property spouse (P26). You can combine this with the Wikidata API, for instance to...

How to deserialize a CoNLL format dependency tree with ClearNLP?

java,nlp,deserialization,parse-tree,clearnlp

Found the answer, so sharing the wisdom on SO :-) ... The deserialization can be done using the TSVReader in the edu.emory.clir.clearnlp.reader package. public void readCoNLL(String inputFile) throws Exception { TSVReader reader = new TSVReader(0, 1, 2, 4, 5, 6, 7); reader.open(new FileInputStream(inputFile)); DEPTree tree; while ((tree = reader.next()) !=...

How to remove a custom word pattern from a text using NLTK with Python

python,regex,nlp,nltk,tokenize

You'll probably need to detect lines containing a question, then extract the question and drop the question number. The regexp for detecting a question label is qnum_pattern = r"^\s*\(Q\d+\)\.\s+" You can use it to pull out the questions like this: questions = [ re.sub(qnum_pattern, "", line) for line in text...

Abbreviation Reference for NLTK Parts of Speach

python,nlp,nltk

The link that you have already mentioned has two different tagsets. For tagset documentation, see nltk.help.upenn_tagset() and nltk.help.brown_tagset(). In this particular example, these tags are from Penn Treebank tagset. You can also read about these tags by: nltk.help.upenn_tagset('DT') nltk.help.upenn_tagset('EX') ...

Navigate an NLTK tree (follow-up)

python,tree,nlp,nltk

I usually use the subtrees function in combination with a filter for this. Changing your tree slightly to show that it only selects one of the NP's now: >>> tree = ParentedTree.fromstring("(S (NP (NNP)) (VP (VBZ) (NP (NNS))))") >>> for st in tree.subtrees(filter = lambda x: x.label() == "NP" and...

Parsing multiple sentences with MaltParser using NLTK

java,python,parsing,nlp,nltk

As I see in your code samples, you don't call tree() in this line >>> print(next(next(mp.parse_sents([sent,sent2])))) while you do call tree() in all cases with parse_one(). Otherwise I don't see the reason why it could happen: parse_one() method of ParserI isn't overridden in MaltParser and everything it does is simply...

Fast shell command to remove stop words in a text file

shell,nlp,text-processing

Here's a method using the command line and perl: Save the text below as removeSW.sh: #! /bin/bash MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b perl -pe "s/$MYREGEX//g" $2 Then if you have saved your file above as stopwords.txt, and have a second file (e.g.) called testtext.txt that contains: This is a file with...

NLP: Arrange words with tags into proper English sentence?

nlp

You should look into language models. A bigram language model, for example, will give you the probability of observing a sentence on the basis of the two-word sequences in that sentence. On the basis of a corpus of texts, it will have learned that "how are" has a higher probability...

How Can I Access the Brown Corpus in Java (aka outside of NLTK)

java,nlp,nltk,corpus,tagged-corpus

Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/ All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious. EDIT: if you want original source data, I think there's...

Stanford CoreNLP wrong coreference resolution

nlp,stanford-nlp

Like many components in AI, the Stanford coreference system is only correct to a certain accuracy. In the case of coreference this accuracy is actually relatively low (~60 on standard benchmarks in a 0-100 range). To illustrate the difficulty of the problem, consider the following apparently similar sentence with a...

Stanford Entity Recognizer (caseless) in Python Nltk

python,nlp,nltk

The second parameter of NERTagger is the path to the stanford tagger jar file, not the path to the model. So, change it to stanford-ner.jar (and place it there, of course). Also it seems that you should choose english.conll.4class.caseless.distsim.crf.ser.gz (from stanford-corenlp-caseless-2015-04-20-models.jar) instead of english.conll.4class.distsim.crf.ser.gz Thus try the following: english_nertagger =...

NLP- Sentiment Processing for Junk Data takes time

nlp,stanford-nlp,sentiment-analysis,pos-tagger

Yes, the standard PCFG parser (the one that is run by default without any other options specified) will choke on this sort of long nonsense data. You might have better luck using the shift-reduce constituency parser, which is substantially faster than the PCFG and nearly as accurate.

Python : How to optimize comparison between two large sets?

python,list,optimization,comparison,nlp

You can create a set of your stop_words, then for every word in your text see if it is in the set. Actually it looks like you are already using a set. Though I don't know why you are sorting it. ...

dynamically populate hashmap with human language dictionary for text analysis

java,dictionary,hashmap,nlp

The task of language identification is well researched and there are a lot of good libraries. For Java, try TIKA, or Language Detection Library for Java (they report "99% over precision for 53 languages"), or TextCat, or LingPipe - I'd suggest to start from the 1st, it seems to have...

Choosing correct word for the given string

nlp,stanford-nlp

You cannot really check for every word because there are certain words which have more than 1 alphabets in their spelling. So one way you could go is - check for each alphabet in the word and restrict its number of consecutive appearances to two now check the new spelling...

Lemmatizer supporting german language (for commercial and research purpose)

machine-learning,nlp,linguistics

LanguageTool can do that (disclaimer: I'm the maintainer of LanguageTool), it's available under LGPL and implemented in Java. You could use GermanTagger.tag(), the result can have more than one reading (as language is often ambiguous), and each reading's AnalyzedToken finally has a lemma.

Only ignore stop words for ngram_range=1

python,nlp,scikit-learn

You have at least 2 options: combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words (more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word...

Using word2vec to calculate similarity between users

nlp,recommendation-engine,mahout-recommender,word2vec

What you need is a simple unsupervised (or semi-supervised) clustering algorithm. word2vec with its pre-trained vectors may not be very helpful because institutions, etc. are unlikely to be in it. Also, it seems that the number of "aspects" a user has it small, so you can simply have a clustering...

Performing semantic analysis in text

nlp,semantic-web

What you are trying to do is essentially Natural Language Understanding, a subfield of Natural Language Processing, which again is a subfield of Computational Linguistics ~ often thought as the engineering arm. You could do semantic parsing or relation extraction. Either are fine for this task. I decided to read...

How to replace a word by its most representative mention using Stanford CoreNLP Coreferences module

java,nlp,stanford-nlp

The challenge is you need to make sure that the token isn't part of its representative mention. For example, the token "Judy" has "Judy 's" as its representative mention, so if you replace it in the phrase "Judy 's", you'll end up with the double "'s". You can check if...

How can the NamedEntityTag be used as EntityMention in RelationMention in the RelationExtractor?

nlp,stanford-nlp

You can match the text from the EntityMention with the NamedEntityText value from the NamedEntityTag.

stemming words in python

python,nlp,stemming

For each suffix in the given list you can check if the given word ends with any of the given suffixes, if yes the remove the suffix, else return the word. suffixes = ['ing'] def stem(word): for suff in suffixes: if word.endswith(suff): return word[:-len(suff)] return word print(stem ('having')) >>> hav...

nltk sentence tokenizer, consider new lines as sentence boundary

python,nlp,nltk,tokenize

Well, I had the same problem and what I have done was split the text in '\n'. Something like this: # in my case, when it had '\n', I called it a new paragraph, # like a collection of sentences paragraphs = [p for p in text.split('\n') if p] #...

Save and reuse TfidfVectorizer in scikit learn

python,nlp,scikit-learn,pickle,text-mining

Firstly, it's better to leave the import at the top of your code instead of within your class: from sklearn.feature_extraction.text import TfidfVectorizer class changeToMatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()): ... Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it...

One Hot Encoding for representing corpus sentences in python

python,machine-learning,nlp,scikit-learn,one-hot

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix. Example code for two simple...

NLP Shift reduce parser is throwing null pointer Exception for Sentiment calculation

nlp,stanford-nlp,sentiment-analysis,shift-reduce

Is there a specific reason why you are using version 3.4.1 and not the latest version? If I run your code with the latest version, it works for me (after I change the path to the SR model to edu/stanford/nlp/models/srparser/englishSR.ser.gz but I assume you changed that path on purpose). Also...

How to properly navigate an NLTK parse tree?

python,tree,nlp,nltk

Based on what you want to do, this should work. It will give you the closest left NP node first, then the second closest, etc. So, if you had a tree of (S (NP1) (VP (NP2) (VBZ))), your np_trees list would have [ParentedTree(NP2), ParentedTree(NP1)]. from nltk.tree import * np_trees =...

How to suppress unmatched words in Stanford NER classifiers?

nlp,stanford-nlp,named-entity-recognition

Hi I'll try to help out! So it sounds to me like you have a list of strings that should be called "CURRENCY", and you have a list of strings that should be called "COUNTRY", etc... And you want something to tag strings based off of your list. So when...

Counting words in list using a dictionary

python,nlp

You should map the misspelled words to the original: words = {'acheive':'achieve', 'achiev':'achieve','achieve':'achieve'} s = "achiev acheive achieve" from collections import Counter from string import punctuation cn = Counter() for word in s.split(): word = word.strip(punctuation) if word in words: wrd = words[word] cn[wrd] += 1 print(cn) Counter({'achieve': 3}) You...

NLP - Word Representations

machine-learning,nlp,artificial-intelligence

Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings. In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates...

Python NLTK pos_tag not returning the correct part-of-speech tag

python,machine-learning,nlp,nltk,pos-tagger

In short: NLTK is not perfect. In fact, no model is perfect. In long: Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.: HunPos Stanford POS Senna Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag: >>> from nltk import word_tokenize, pos_tag >>> text = "The quick brown fox jumps over...

Are word-vector orientations universal?

nlp,word2vec

As outlined in the comments, "orientation" is not a well-defined concept in this context. A traditional word vector space has one dimension for each term. In order for word vectors to be compatible, they will need to have the same term order. This is typically not the case between different...

How apache UIMA is different from Apache Opennlp

nlp,opennlp,uima

As I understand the question, you are asking for the differences between the feature sets of Apache UIMA and Apache OpenNLP. Their feature sets barely have anything in common as these two projects have very different aims. Apache UIMA is an open source implementation of the UIMA specification. The latter...

Amazon Machine Learning for sentiment analysis

amazon-web-services,machine-learning,nlp,sentiment-analysis

You can build a good machine learning model for sentiment analysis using Amazon ML. Here is a link to a github project that is doing just that: https://github.com/awslabs/machine-learning-samples/tree/master/social-media Since the Amazon ML supports supervised learning as well as text as input attribute, you need to get a sample of data...

Separately tokenizing and pos-tagging with CoreNLP

java,nlp,stanford-nlp

It seems to me that you would be better off separating the tokenization phase from your other downstream tasks (so I'm basically answering Question 2). You have two options: Tokenize using the Stanford tokenizer (example from Stanford CoreNLP usage page). The annotators options should only take 'tokenizer' in your case....

How to extract brand from product name

python,machine-learning,nlp

The information you're trying to get isn't actually there. If you take two strings, both of which may have any number of spaces, and join them together with a space, it's no longer possible to tell unambiguously which space was joining the two strings, and which spaces were part of...

StanfordNLP lemmatization cannot handle -ing words

java,nlp,stanford-nlp,stemming,lemmatization

Lemmatization crucially depends on the part of speech of the token. Only tokens with the same part of speech are mapped to the same lemma. In the sentence "This is confusing", confusing is analyzed as an adjective, and therefore it is lemmatized to confusing. In the sentence "I was confusing...

How to un-stem a word in Python?

python,nlp,nltk

I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb. check out the pattern package pip install pattern Then use the en.lemma function to return a verb's base form. import...

The ± 2 window in Word similarity of NLP

vector,nlp,distribution

A ± 2 window means 2 words to the left and 2 words to the right of the target word. For target word "silence", the window would be ["gavel", "to", "the", "court"], and for "hammer", it would be ["when", "the", "struck", "it"].

Documentation of Moses (statistical machine translation) mose.ini file format?

machine-learning,nlp,machine-translation,moses

The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications. Here are some excerpt I found on the Web about moses.ini. In the Moses Core,...

Stanford coreNLP : can a word in a sentence be part of multiple Coreference chains

nlp,stanford-nlp

A word can be part of multiple coreference mentions. Consider for example the mention "the new acquisition by Microsoft". In this case, there are two candidates for mentions: the new acquisition by Microsoft and Microsoft. From this example it also follows that a word can be part of multiple coreference...

Stanford NLP - Using Parsed or Tagged text to generate Full XML

parsing,nlp,stanford-nlp,pos-tagging

Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse...

How to Identify mentions in a text?

nlp,stanford-nlp

I think you can get what you want from the standard dcoref annotator. Look at the annotation set by this annotator, CorefChainAnnotation. This is a map from document entities to "coref chains." Each CorefChain can provide you with a list of mentions for the relevant entity in textual order....

Tabulating characters with diacritics in R

r,unicode,nlp,linguistics

Try library(stringi) table(stri_split_boundaries(word, type='character')) #a n n̥ #2 1 1 Or table(strsplit(word, '(?<=\\P{Ll}|\\w)(?=\\w)', perl=TRUE)) #a n n̥ #2 1 1 ...

Implementing Naive Bayes text categorization but I keep getting zeros

python,algorithm,nlp,text-classification,naivebayes

As Ed Cottrell commented, you need to consider what happens if you encounter a word that is not in the documents in a category. You can avoid multiplying by 0 by using Laplace smoothing. If you see a word in k out of n documents in a category, you assign...

NLP - Error while Tokenization and Tagging etc [duplicate]

java,nlp,stanford-nlp

You're trying to use Stanford NLP tools version 3.5 or later using a version of Java 7 or earlier. Either upgrade to Java 8 or downgrade to Stanford tools version 3.4.1.

POS of WSJ in CONLL format from penn tree bank

nlp

Check the perl script available from the task page: http://ilk.uvt.nl/team/sabine/homepage/software.html

Annotator for Relationship Extraction

regex,nlp,nltk,stanford-nlp,gate

There is no PR in GATE that that will pair arguments and create instances for you. You must therefore create instances that are relevant to your problem. You can: write a custom PR or write some JAPE with Java RHS You can probably split your corpus on a training and...

NLTK getting dependencies from raw text

python-2.7,nlp,nltk

You are instantiating MaltParser with an unsuitable argument. Running help(MaltParser) gives the following information: Help on class MaltParser in module nltk.parse.malt: class MaltParser(nltk.parse.api.ParserI) | Method resolution order: | MaltParser | nltk.parse.api.ParserI | __builtin__.object | | Methods defined here: | | __init__(self, tagger=None, mco=None, working_dir=None, additional_java_args=None) | An interface for parsing...

Stanford Parser - Factored model and PCFG

parsing,nlp,stanford-nlp,sentiment-analysis,text-analysis

This FAQ answer explains the difference in a long paragraph. Relevant parts are quoted below: Can you explain the different parsers? This answer is specific to English. It mostly applies to other languages although some components are missing in some languages. The file englishPCFG.ser.gz comprises just an unlexicalized PCFG grammar....

What exactly is the difference between AnalysisEngine and CAS Consumer?

nlp,uima

There is no difference between them in the current version. Historically, a CASConsumer would tipically not modify the CAS, but only use the data existing in the CAS (previously added by an Analysis Engine) to aggregate it/prepare it for use in other systems, e.g., ingestion in databases. In the current...

Handling count of characters with diacritics in R

r,unicode,character-encoding,nlp,linguistics

Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then: Use Unicode package; it provide the function Unicode_alphabetic_tokenizer that: Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have...

How to not split English into separate letters in the Stanford Chinese Parser

python,nlp,stanford-nlp,segment,chinese-locale

I don't know about tokenization in mixed language texts, so I propose to use the following hack: go through the text, until you find English word; all text before this word can be tokenized by Chinese tokenizer; English word can be append as another token; repeat. Below is code sample....

Text analysis : What after term-document matrix? [closed]

r,machine-learning,nlp,svm,text-mining

This isn't really a programming question, but anyway: If your goal is prediction, as opposed to text classification, usual methods are backoff models (Katz Backoff) and interpolation/smoothing, e.g. Kneser-Ney smoothing. More complicated models like Random Forests are AFAIK not absolutely necessary and may pose problems if you need to make...

How to split a sentence in Python?

python-3.x,nlp

If your text is already split into sentences, just use tokens = nltk.word_tokenize(sentence) (see tokenization from NLTK). If you need to split by sentences first, take a look at this part (there is code sample)....

Getting the maximum common words in R

r,nlp

I think this gets what you're looking for: library(dplyr) # make up some data set.seed(1492) rbind_all(lapply(1:15, function(i) { x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10))) colnames(x) <- c("ID", sprintf("A%d", 1:10)) x })) -> dat print(dat) ## Source: local data frame [15 x 11] ## ## ID A1 A2 A3 A4 A5...

PHP: Translate in natural language a weekly calendar availability?

javascript,php,algorithm,nlp

7 possible combinations (from most specific to least): Morning, Afternoon, Night Morning, Afternoon Morning, Night Afternoon, Night Morning Afternoon Night You can identify the combinations similar to the way unix permissions are identified: $111 = "Morning - Night"; $110 = "Morning - Afternoon"; $101 = "Morning - Night"; $011 =...

Transform column of strings with word cluster values

r,nlp,cluster-computing

You could try something like this: library(tidyr) library(dplyr) library(stringi) df1 <- unnest(stri_split_fixed(original_text_df$Text, ' '), group) %>% group_by(x) %>% mutate(cluster = cluster_df$Cluster[cluster_df$Word %in% x]) Which gives: #Source: local data frame [8 x 3] #Groups: x # # group x cluster #1 X1 this 2 #2 X1 is 2 #3 X1 some...

What is the most efficient way of storing language models in NLP applications?

nlp,n-gram,language-model

The most common data structures in language models are tries and hash tables. You can take a look at Kenneth Heafield's paper on his own language model toolkit KenLM for more detailed information about the data structures used by his own software and related packages.

Identify prepositons and individual POS

nlp,stanford-nlp

I have had some breakthrough to understand if the word is actually preposition or subordinating conjunction. I have parsed following sentence : She left early because Mike arrived with his new girlfriend. (here because is subordinating conjunction ) After POS tagging She_PRP left_VBD early_RB because_IN Mike_NNP arrived_VBD with_IN his_PRP$ new_JJ...

term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?

python,matplotlib,plot,nlp

Your small data set script is largely correct, but with some minor errors. You are missing the if i=='date': continue line. (The source of your 'not comparable' error). In your post, your else line is mis-indented. Possibly (only possibly) you need a call to plt.hold(True) to prevent the creation of...

Mecab output - list of name types

nlp,translation,pos-tagger,mecab

Looks like this has all the info: http://www.unixuser.org/~euske/doc/postag/, depending on the output selected for mecab.

Problems obtaining most informative features with scikit learn?

python,pandas,machine-learning,nlp,scikit-learn

To solve this specifically for linear SVM, we first have to understand the formulation of the SVM in sklearn and the differences that it has to MultinomialNB. The reason why the most_informative_feature_for_class works for MultinomialNB is because the output of the coef_ is essentially the log probability of features given...

Where can I find a corpus of search engine queries?

nlp,search-engine,google-search,bing

There are a couple of datasets like this: Yahoo Weboscope:- http://webscope.sandbox.yahoo.com/catalog.php?datatype=l Yandex Datasets:- https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data A part of Kaggle problem. You can sign up and download. There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not...

Get noun from verb Wordnet

python,nlp,wordnet

You could try something like this: def nounify(verb_word): set_of_related_nouns = set() for lemma in wn.lemmas(wn.morphy(verb_word, wn.VERB), pos="v"): for related_form in lemma.derivationally_related_forms(): for synset in wn.synsets(related_form.name(), pos=wn.NOUN): if wn.synset('person.n.01') in synset.closure(lambda s:s.hypernyms()): set_of_related_nouns.add(synset) return set_of_related_nouns This method looks up all derivationally related nouns to a verb, and checks if they have...

Chinese sentence segmenter with Stanford coreNLP

java,nlp,tokenize,stanford-nlp

Using Stanford Segmenter instead: $ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-04-20.zip $ unzip stanford-segmenter-2015-04-20.zip $ echo "应有尽有的丰富选择定将为您的旅程增添无数的赏心乐事" > input.txt $ bash stanford-segmenter-2015-04-20/segment.sh ctb input.txt UTF-8 0 > output.txt $ cat output.txt 应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事 Other than Stanford Segmenter, there are many other...

Coreference resolution using Stanford CoreNLP

java,nlp,stanford-nlp

There is a Annotation constructor with a List<CoreMap> sentences argument which sets up the document if you have a list of already tokenized sentences. For each sentence you want to create a CoreMap object as following. (Note that I also added a sentence and token index to each sentence and...

How to use serialized CRFClassifier with StanfordCoreNLP prop 'ner'

java,nlp,stanford-nlp

The property name is ner.model, not ner.models, so your code is still trying to load the default models. Let me know if this is documented incorrectly somewhere....

How to define a CAS in database as external resource for an annotator in uimaFIT?

nlp,data-mining,information-retrieval,uima

You can define the dependency with @TypeCapability like this: @TypeCapability(inputs = { "com.myproject.types.MyType", ... }, outputs = { ... }) public class MyAnnotator extends JCasAnnotator_ImplBase { .... } Note that it defines a contract at the annotation level, not the engine level (meaning that any Engine could create com.myproject.types.MyType). I...

NLTK fcfg sem value is awkward

python,nlp,nltk,context-free-grammar

This is the normal form for nouns: N[NUM=sg, SEM=<\x.race(x)>] -> 'race' So, for example, you might have: Det[NUM=sg, SEM=<\P.\Q.exists x.(P(x) & Q(x))>] -> 'a' NP[NUM=?num, SEM=<?det(?n)>] -> Det[NUM=?num, SEM=?det] N[NUM=?num, SEM=?n] So that "a race" is: \P.\Q.exists x.(P(x) & Q(x))(\x.race(x)) = \Q.exists x.(\y.race(y)(x) & Q(x))) \Q.exists x.(race(x) & Q(x))) And...

software to extract word functions like subject, predicate, object etc

nlp,stanford-nlp

According to http://nlp.stanford.edu/software/lex-parser.shtml, Stanford NLP does have a parser which can identify the subject and predicate of a sentence. You can try it out online http://nlp.stanford.edu:8080/parser/index.jsp. You can use the typed dependencies to identify the subject, predicate, and object. From the example page, the sentence My dog also likes eating...

How to interpret scikit's learn confusion matrix and classification report?

machine-learning,nlp,scikit-learn,svm,confusion-matrix

Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help....

Add Features to An Sklearn Classifier

python,machine-learning,nlp,scikit-learn

You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces There is a nice example in the documentation http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#example-hetero-feature-union-py which I think exactly fits your requirements. See TextStats transformer. Regards,...

How to extract derivation rules from a bracketed parse tree?

java,parsing,recursion,nlp,pseudocode

I wrote this for python. I believe you can read it as a pseudocode. I will edit the post for Java later . I added the Java implementation later. import re # grammar repository grammar_repo = [] s = "( S ( NP-SBJ ( PRP I ) ) ( [email protected]