Menu
  • HOME
  • TAGS

user defined feature in CRF++

Tag: nlp,crf,crf++

I tried to add more feature to CRF++ template.

According to How can I tell CRF++ classifier that a word x is captilized or understanding punctuations?

training sample

The  DT  0  1   0   1   B-MISC
Oxford  NNP 0   1   0   1   I-MISC
Companion   NNP 0   1   0   1   I-MISC
to  TO  0   0   0   0   I-MISC
Philosophy  NNP 0   1   0   1   I-MISC

feature template

# Unigram
U00:%x[-2,0]
U01:%x[-1,0]
U02:%x[0,0]
U03:%x[1,0]
U04:%x[2,0]
U05:%x[-1,0]/%x[0,0]
U06:%x[0,0]/%x[1,0]
U07:%x[-2,0]/%x[-1,0]/%x[0,0]

#shape feature
U08:%x[-2,2]
U09:%x[-1,2]
U10:%x[0,2]
U11:%x[1,2]
U12:%x[2,2]

B

The traing phase is ok. But I get no ouput with crf_test

[email protected]untu:/data/wikipedia/en$ crf_test -m validation_model test.data
[email protected]:/data/wikipedia/en$ 

Everything works fine if ignore the shape fearture above. where did I go wrong?

Best How To :

I figured this out. It's the problem with my test data. I thought that every feature should be taken from the trained model, so I only have two columns in my test data: word tag, which turns out that the test file should have the exact same format as the training data do!

extracting n grams from huge text

python,performance,nlp,bigdata,text-processing

First of all, make sure that you have lines in your file then with no worries you can read it line-by-line (discussed here): with open('my100GBfile.txt') as corpus: for line in corpus: sequence = preprocess(line) extract_n_grams(sequence) Let's assume that your corpus doesn't need any special treatment. I guess you can find...

How to use serialized CRFClassifier with StanfordCoreNLP prop 'ner'

java,nlp,stanford-nlp

The property name is ner.model, not ner.models, so your code is still trying to load the default models. Let me know if this is documented incorrectly somewhere....

Chinese sentence segmenter with Stanford coreNLP

java,nlp,tokenize,stanford-nlp

Using Stanford Segmenter instead: $ wget http://nlp.stanford.edu/software/stanford-segmenter-2015-04-20.zip $ unzip stanford-segmenter-2015-04-20.zip $ echo "应有尽有的丰富选择定将为您的旅程增添无数的赏心乐事" > input.txt $ bash stanford-segmenter-2015-04-20/segment.sh ctb input.txt UTF-8 0 > output.txt $ cat output.txt 应有尽有 的 丰富 选择 定 将 为 您 的 旅程 增添 无数 的 赏心 乐事 Other than Stanford Segmenter, there are many other...

Implementing Naive Bayes text categorization but I keep getting zeros

python,algorithm,nlp,text-classification,naivebayes

As Ed Cottrell commented, you need to consider what happens if you encounter a word that is not in the documents in a category. You can avoid multiplying by 0 by using Laplace smoothing. If you see a word in k out of n documents in a category, you assign...

Performing semantic analysis in text

nlp,semantic-web

What you are trying to do is essentially Natural Language Understanding, a subfield of Natural Language Processing, which again is a subfield of Computational Linguistics ~ often thought as the engineering arm. You could do semantic parsing or relation extraction. Either are fine for this task. I decided to read...

Abbreviation Reference for NLTK Parts of Speach

python,nlp,nltk

The link that you have already mentioned has two different tagsets. For tagset documentation, see nltk.help.upenn_tagset() and nltk.help.brown_tagset(). In this particular example, these tags are from Penn Treebank tagset. You can also read about these tags by: nltk.help.upenn_tagset('DT') nltk.help.upenn_tagset('EX') ...

Tabulating characters with diacritics in R

r,unicode,nlp,linguistics

Try library(stringi) table(stri_split_boundaries(word, type='character')) #a n n̥ #2 1 1 Or table(strsplit(word, '(?<=\\P{Ll}|\\w)(?=\\w)', perl=TRUE)) #a n n̥ #2 1 1 ...

How to extract derivation rules from a bracketed parse tree?

java,parsing,recursion,nlp,pseudocode

I wrote this for python. I believe you can read it as a pseudocode. I will edit the post for Java later . I added the Java implementation later. import re # grammar repository grammar_repo = [] s = "( S ( NP-SBJ ( PRP I ) ) ( [email protected]

NLTK getting dependencies from raw text

python-2.7,nlp,nltk

You are instantiating MaltParser with an unsuitable argument. Running help(MaltParser) gives the following information: Help on class MaltParser in module nltk.parse.malt: class MaltParser(nltk.parse.api.ParserI) | Method resolution order: | MaltParser | nltk.parse.api.ParserI | __builtin__.object | | Methods defined here: | | __init__(self, tagger=None, mco=None, working_dir=None, additional_java_args=None) | An interface for parsing...

Separately tokenizing and pos-tagging with CoreNLP

java,nlp,stanford-nlp

It seems to me that you would be better off separating the tokenization phase from your other downstream tasks (so I'm basically answering Question 2). You have two options: Tokenize using the Stanford tokenizer (example from Stanford CoreNLP usage page). The annotators options should only take 'tokenizer' in your case....

Coreference resolution using Stanford CoreNLP

java,nlp,stanford-nlp

There is a Annotation constructor with a List<CoreMap> sentences argument which sets up the document if you have a list of already tokenized sentences. For each sentence you want to create a CoreMap object as following. (Note that I also added a sentence and token index to each sentence and...

How to remove a custom word pattern from a text using NLTK with Python

python,regex,nlp,nltk,tokenize

You'll probably need to detect lines containing a question, then extract the question and drop the question number. The regexp for detecting a question label is qnum_pattern = r"^\s*\(Q\d+\)\.\s+" You can use it to pull out the questions like this: questions = [ re.sub(qnum_pattern, "", line) for line in text...

Stanford coreNLP : can a word in a sentence be part of multiple Coreference chains

nlp,stanford-nlp

A word can be part of multiple coreference mentions. Consider for example the mention "the new acquisition by Microsoft". In this case, there are two candidates for mentions: the new acquisition by Microsoft and Microsoft. From this example it also follows that a word can be part of multiple coreference...

NLP Shift reduce parser is throwing null pointer Exception for Sentiment calculation

nlp,stanford-nlp,sentiment-analysis,shift-reduce

Is there a specific reason why you are using version 3.4.1 and not the latest version? If I run your code with the latest version, it works for me (after I change the path to the SR model to edu/stanford/nlp/models/srparser/englishSR.ser.gz but I assume you changed that path on purpose). Also...

software to extract word functions like subject, predicate, object etc

nlp,stanford-nlp

According to http://nlp.stanford.edu/software/lex-parser.shtml, Stanford NLP does have a parser which can identify the subject and predicate of a sentence. You can try it out online http://nlp.stanford.edu:8080/parser/index.jsp. You can use the typed dependencies to identify the subject, predicate, and object. From the example page, the sentence My dog also likes eating...

The ± 2 window in Word similarity of NLP

vector,nlp,distribution

A ± 2 window means 2 words to the left and 2 words to the right of the target word. For target word "silence", the window would be ["gavel", "to", "the", "court"], and for "hammer", it would be ["when", "the", "struck", "it"].

Identify prepositons and individual POS

nlp,stanford-nlp

I have had some breakthrough to understand if the word is actually preposition or subordinating conjunction. I have parsed following sentence : She left early because Mike arrived with his new girlfriend. (here because is subordinating conjunction ) After POS tagging She_PRP left_VBD early_RB because_IN Mike_NNP arrived_VBD with_IN his_PRP$ new_JJ...

NLP- Sentiment Processing for Junk Data takes time

nlp,stanford-nlp,sentiment-analysis,pos-tagger

Yes, the standard PCFG parser (the one that is run by default without any other options specified) will choke on this sort of long nonsense data. You might have better luck using the shift-reduce constituency parser, which is substantially faster than the PCFG and nearly as accurate.

How to split a sentence in Python?

python-3.x,nlp

If your text is already split into sentences, just use tokens = nltk.word_tokenize(sentence) (see tokenization from NLTK). If you need to split by sentences first, take a look at this part (there is code sample)....

Annotator dependencies: UIMA Type Capabilities?

java,annotations,nlp,uima,dkpro-core

In UIMA the primary way in which one assures that the annotations you need are present for your annotator is to aggregate annotators together. So to answer your question, that is how you are going to achieve what you want because what you want to do (have UIMA figure our...

Amazon Machine Learning for sentiment analysis

amazon-web-services,machine-learning,nlp,sentiment-analysis

You can build a good machine learning model for sentiment analysis using Amazon ML. Here is a link to a github project that is doing just that: https://github.com/awslabs/machine-learning-samples/tree/master/social-media Since the Amazon ML supports supervised learning as well as text as input attribute, you need to get a sample of data...

One Hot Encoding for representing corpus sentences in python

python,machine-learning,nlp,scikit-learn,one-hot

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix. Example code for two simple...

Where can I find a corpus of search engine queries?

nlp,search-engine,google-search,bing

There are a couple of datasets like this: Yahoo Weboscope:- http://webscope.sandbox.yahoo.com/catalog.php?datatype=l Yandex Datasets:- https://www.kaggle.com/c/yandex-personalized-web-search-challenge/data A part of Kaggle problem. You can sign up and download. There are also AOL Query Logs and MSN Query Logs which had been publicised as part of shared tasks in past 10 years. I'm not...

How to not split English into separate letters in the Stanford Chinese Parser

python,nlp,stanford-nlp,segment,chinese-locale

I don't know about tokenization in mixed language texts, so I propose to use the following hack: go through the text, until you find English word; all text before this word can be tokenized by Chinese tokenizer; English word can be append as another token; repeat. Below is code sample....

Stanford Parser - Factored model and PCFG

parsing,nlp,stanford-nlp,sentiment-analysis,text-analysis

This FAQ answer explains the difference in a long paragraph. Relevant parts are quoted below: Can you explain the different parsers? This answer is specific to English. It mostly applies to other languages although some components are missing in some languages. The file englishPCFG.ser.gz comprises just an unlexicalized PCFG grammar....

term frequency over time: how to plot +200 graphs in one plot with Python/pandas/matplotlib?

python,matplotlib,plot,nlp

Your small data set script is largely correct, but with some minor errors. You are missing the if i=='date': continue line. (The source of your 'not comparable' error). In your post, your else line is mis-indented. Possibly (only possibly) you need a call to plt.hold(True) to prevent the creation of...

Fast shell command to remove stop words in a text file

shell,nlp,text-processing

Here's a method using the command line and perl: Save the text below as removeSW.sh: #! /bin/bash MYREGEX=\\b\(`perl -pe 's/\n/|/g' $1`\)\\b perl -pe "s/$MYREGEX//g" $2 Then if you have saved your file above as stopwords.txt, and have a second file (e.g.) called testtext.txt that contains: This is a file with...

Save and reuse TfidfVectorizer in scikit learn

python,nlp,scikit-learn,pickle,text-mining

Firstly, it's better to leave the import at the top of your code instead of within your class: from sklearn.feature_extraction.text import TfidfVectorizer class changeToMatrix(object): def __init__(self,ngram_range=(1,1),tokenizer=StemTokenizer()): ... Next StemTokenizer don't seem to be a canonical class. Possibly you've got it from http://sahandsaba.com/visualizing-philosophers-and-scientists-by-the-words-they-used-with-d3js-and-python.html or maybe somewhere else so we'll assume it...

Counting words in list using a dictionary

python,nlp

You should map the misspelled words to the original: words = {'acheive':'achieve', 'achiev':'achieve','achieve':'achieve'} s = "achiev acheive achieve" from collections import Counter from string import punctuation cn = Counter() for word in s.split(): word = word.strip(punctuation) if word in words: wrd = words[word] cn[wrd] += 1 print(cn) Counter({'achieve': 3}) You...

What is the default behavior of Stanford NLP's WordsToSentencesAnnotator when splitting a text into sentences?

nlp,stanford-nlp

It does split on these characters, however only when they appear as their own token and not at the end of an abbreviation such as in "etc.". So the issue here is not the sentence splitter but the tokenizer which thinks that "N." is an abbreviation and therefore does not...

Parsing multiple sentences with MaltParser using NLTK

java,python,parsing,nlp,nltk

As I see in your code samples, you don't call tree() in this line >>> print(next(next(mp.parse_sents([sent,sent2])))) while you do call tree() in all cases with parse_one(). Otherwise I don't see the reason why it could happen: parse_one() method of ParserI isn't overridden in MaltParser and everything it does is simply...

Getting the maximum common words in R

r,nlp

I think this gets what you're looking for: library(dplyr) # make up some data set.seed(1492) rbind_all(lapply(1:15, function(i) { x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10))) colnames(x) <- c("ID", sprintf("A%d", 1:10)) x })) -> dat print(dat) ## Source: local data frame [15 x 11] ## ## ID A1 A2 A3 A4 A5...

NLP: Arrange words with tags into proper English sentence?

nlp

You should look into language models. A bigram language model, for example, will give you the probability of observing a sentence on the basis of the two-word sequences in that sentence. On the basis of a corpus of texts, it will have learned that "how are" has a higher probability...

How to deserialize a CoNLL format dependency tree with ClearNLP?

java,nlp,deserialization,parse-tree,clearnlp

Found the answer, so sharing the wisdom on SO :-) ... The deserialization can be done using the TSVReader in the edu.emory.clir.clearnlp.reader package. public void readCoNLL(String inputFile) throws Exception { TSVReader reader = new TSVReader(0, 1, 2, 4, 5, 6, 7); reader.open(new FileInputStream(inputFile)); DEPTree tree; while ((tree = reader.next()) !=...

POS of WSJ in CONLL format from penn tree bank

nlp

Check the perl script available from the task page: http://ilk.uvt.nl/team/sabine/homepage/software.html

How apache UIMA is different from Apache Opennlp

nlp,opennlp,uima

As I understand the question, you are asking for the differences between the feature sets of Apache UIMA and Apache OpenNLP. Their feature sets barely have anything in common as these two projects have very different aims. Apache UIMA is an open source implementation of the UIMA specification. The latter...

Python NLTK pos_tag not returning the correct part-of-speech tag

python,machine-learning,nlp,nltk,pos-tagger

In short: NLTK is not perfect. In fact, no model is perfect. In long: Try using other tagger (see https://github.com/nltk/nltk/tree/develop/nltk/tag) , e.g.: HunPos Stanford POS Senna Using default MaxEnt POS tagger from NLTK, i.e. nltk.pos_tag: >>> from nltk import word_tokenize, pos_tag >>> text = "The quick brown fox jumps over...

How can the NamedEntityTag be used as EntityMention in RelationMention in the RelationExtractor?

nlp,stanford-nlp

You can match the text from the EntityMention with the NamedEntityText value from the NamedEntityTag.

Python NLTK - Making a 'Dictionary' from a Corpus and Saving the Number Tags

python,nlp,nltk,corpus,tagged-corpus

Since you just want the POS of loose words you don't need ngrams, you need a tagged corpus. Assuming your corpus is already tagged, you can do it like this. >>> from nltk.corpus import brown >>> wordcounts = nltk.ConditionalFreqDist(brown.tagged_words()) >>> wordcounts["set"].tabulate(10) VBN VB NN VBD VBN-HL NN-HL 159 88 86...

Are word-vector orientations universal?

nlp,word2vec

As outlined in the comments, "orientation" is not a well-defined concept in this context. A traditional word vector space has one dimension for each term. In order for word vectors to be compatible, they will need to have the same term order. This is typically not the case between different...

How Can I Access the Brown Corpus in Java (aka outside of NLTK)

java,nlp,nltk,corpus,tagged-corpus

Here's a link to the download page for the Brown Corpus: http://www.nltk.org/nltk_data/ All the files are zip files. The data format is described on the Brown Corpus Wikipedia. I dunno what else to say. From there things should be obvious. EDIT: if you want original source data, I think there's...

NLP - Word Representations

machine-learning,nlp,artificial-intelligence

Your tests sound very reasonable — they are the usual evaluation tasks that are used in research papers to test the quality of word embeddings. In addition, the website www.wordvectors.org can give you a good idea of how your vectors measure up. It allows you to upload your embeddings, generates...

stemming words in python

python,nlp,stemming

For each suffix in the given list you can check if the given word ends with any of the given suffixes, if yes the remove the suffix, else return the word. suffixes = ['ing'] def stem(word): for suff in suffixes: if word.endswith(suff): return word[:-len(suff)] return word print(stem ('having')) >>> hav...

What exactly is the difference between AnalysisEngine and CAS Consumer?

nlp,uima

There is no difference between them in the current version. Historically, a CASConsumer would tipically not modify the CAS, but only use the data existing in the CAS (previously added by an Analysis Engine) to aggregate it/prepare it for use in other systems, e.g., ingestion in databases. In the current...

Stanford CoreNLP wrong coreference resolution

nlp,stanford-nlp

Like many components in AI, the Stanford coreference system is only correct to a certain accuracy. In the case of coreference this accuracy is actually relatively low (~60 on standard benchmarks in a 0-100 range). To illustrate the difficulty of the problem, consider the following apparently similar sentence with a...

Stanford Entity Recognizer (caseless) in Python Nltk

python,nlp,nltk

The second parameter of NERTagger is the path to the stanford tagger jar file, not the path to the model. So, change it to stanford-ner.jar (and place it there, of course). Also it seems that you should choose english.conll.4class.caseless.distsim.crf.ser.gz (from stanford-corenlp-caseless-2015-04-20-models.jar) instead of english.conll.4class.distsim.crf.ser.gz Thus try the following: english_nertagger =...

Handling count of characters with diacritics in R

r,unicode,character-encoding,nlp,linguistics

Here is my solution. The idea is that phonetic alphabets can have an unicode representation and then: Use Unicode package; it provide the function Unicode_alphabetic_tokenizer that: Tokenization first replaces the elements of x by their Unicode character sequences. Then, the non- alphabetic characters (i.e., the ones which do not have...

Natural Language Search (user intent search)

nlp,search-engine,keyword,voice-recognition,naturallyspeaking

See the following example: "find me an android phone with 4 gb ram and at least 16 gb storage." First of all you need a list of words that you can directly extract from the input and insert in your search query.This is the easy part. "find me an android...

Create Dictionary from Penn Treebank Corpus sample from NLTK?

python,dictionary,nlp,nltk,corpus

Quick solution: >>> from nltk.corpus import treebank >>> from nltk import ConditionalFreqDist as cfd >>> from itertools import chain >>> treebank_tagged_words = list(chain(*list(chain(*[[tree.pos() for tree in treebank.parsed_sents(pf)] for pf in treebank.fileids()])))) >>> wordcounts = cfd(treebank_tagged_words) >>> treebank_tagged_words[0] (u'Pierre', u'NNP') >>> wordcounts[u'Pierre'] FreqDist({u'NNP': 1}) >>> treebank_tagged_words[100] (u'asbestos', u'NN')...

How to interpret scikit's learn confusion matrix and classification report?

machine-learning,nlp,scikit-learn,svm,confusion-matrix

Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help....