Menu
  • HOME
  • TAGS

Generate an Arff File for Weka

java,weka,text-classification,arff

You can use Weka's StringToWordVector filter to convert the text into a word vector (but not necessarily a sparse matrix). Take a look at my tutorial on this.

CountVectorizer deleting features that only appear once

python,machine-learning,scikit-learn,text-classification

So, it's impossible to say without actually seeing the source code of setup_data, but I have a pretty decent guess as to what is going on here. sklearn follows the fit_transform format, meaning there are two stages, specifically fit, and transform. In the example of the CountVectorizer the fit stage...

Converting Multilabel dataset into Single Label?

machine-learning,weka,data-mining,rapidminer,text-classification

The simplest way is to break the dataset into binary problems. If for example you have the datasets text1: science text2: sports, politics Break the dataset into 3 datasets: dataset1 (science): text1:true, text2:false dataset2 (sports): text2:false, text2:true dataset3 (science): text1:false, text2:true Create 3 binary classifiers, one for each class, use...

How do I tell if a noun is a person, place, or a thing?

nlp,text-classification

Since you are dealing with classification, it might be interesting for you to have a look at AlchemyAPI, http://www.alchemyapi.com/products/features/. You have a free api key where you can try things. But this doesn't stops here, if you want to do it manually, as your can see in @tripleee answer, WordNet...

How to train a naive bayes classifier with pos-tag sequence as a feature?

machine-learning,nltk,stanford-nlp,text-classification,naivebayes

If you know how to train and predict texts (or sentences in your case) using nltk's naive bayes classifier and words as features, than you can easily extend this approach in order to classify texts by pos-tags. This is because the classifier don't care about whether your feature-strings are words...

trouble with nltk python NaiveBayesClassifier, I keep getting same probabilities inputs correct?

python,classification,nltk,text-classification

Some background, the OP purpose is to build a classifier for this purpose: https://github.com/alejandrovega44/CSCE-470-Anime-Recommender Firstly, there are several methodological issues, i terms of what you're calling things. You training data should be the raw data you're using for your task, i.e. the json file at: https://raw.githubusercontent.com/alejandrovega44/CSCE-470-Anime-Recommender/naive2/py/UserAnime2 And the data structure...

Batch Filtering with Multi-Filter throws a 'Class attribute not set' exception

command-line,weka,text-classification

Could not get the batch filter to work with any resampling filter. However, our workaround was to simply resample (and then randomize) the training data as step 1. From this reduced set we ran batch filters for everything else we wanted on the test set. This seemed to work fine.

classify keywords into fields

keyword,text-classification

This is not a so trivial problem. In fact, it's a standard question in Machine Learning, so there isn't a dirty easy solution.

How to identifying the exact instances that are wrongly classified in weka

weka,text-classification

Below is a method that will help you to solve your problem. So, You can edit it to reach your aim. public void showPredictions( ){ BufferedReader reader=null; reader= new BufferedReader(new FileReader("SparseDTM.arff")); Instances data = new Instances(reader); double[] predictions; try { NaiveBayes classifier = new NaiveBayes(); classifier.buildClassifier(data); predictions = eval.evaluateModel(classifier, data...

I'm not sure how to interpret accuracy of this classification with Scikit Learn

python,machine-learning,scikit-learn,classification,text-classification

text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", ...) According to the documentation, that line loads your file's contents from C:/Users/USERNAME/projects/machine_learning/my_project/train into text_data.data. It will also load target labels (represented by their integer indexes) for each document into text_data.target. So text_data.data should be a list of strings and text_data.target a list of integers. The labels...

Using Topic Model, how should we set up a “stop words” list?

stop-words,lda,topic-modeling,text-classification

I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard . list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have...

Implementation of text classification in MATLAB with naive bayes

matlab,classification,text-classification

Here is an example of Naive Bayes classification, x1 = 5 * rand(100,1); y1 = 5 * rand(100,1); data1 = [x1,y1]; x2 = -5 * rand(100,1); y2 = 5 * rand(100,1); data2 = [x2,y2]; x3 = -5 * rand(100,1); y3 = -5 * rand(100,1); data3 = [x3,y3]; traindata = [data1(1:50,:);data2(1:50,:);data3(1:50,:)];...

With TfidfVectorizer, is it possible to use one corpus for idf information, and another one for the actual index?

scikit-learn,tf-idf,text-classification

if I understand you correctly. You can fit your TFIDF model to all of your data and then call transform on the smaller tagged corpus: vec =TfidfVectorizer() model = vec.fit(alldata) tagged_data_tfidf = vec.transform(tagged_data) ...

Implementing Naive Bayes text categorization but I keep getting zeros

python,algorithm,nlp,text-classification,naivebayes

As Ed Cottrell commented, you need to consider what happens if you encounter a word that is not in the documents in a category. You can avoid multiplying by 0 by using Laplace smoothing. If you see a word in k out of n documents in a category, you assign...

How to output resultant documents from Weka text-classification

machine-learning,weka,sentiment-analysis,text-classification

The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying...

How can I use my text classifier in practice? As of getting the tf-idf values of new comments

weka,text-classification

I am attempting to answer the question using a different text classification task than spam classification. Say, I have the following training data: "The US government had imposed extra taxes on crude oil", petrolium "The German manufacturers are observing different genes of Canola oil", non-petrolium And the following test data:...

text classifier with weka: how to correctly train a classifier issue

java,weka,text-classification,categorization

It seems like you changed the code from the website you referenced in some crucial points, but not in a good way. I'll try to draft what you're trying to do and what mistakes I've found. What you (probably) wanted to do in extractFeature is Split each tweet into words...

Weka ,Text Classification on an arff file

weka,text-classification

I found the videos below quite helpful when I first got my hands on text classification using Weka. You might want to take a look. Weka Tutorial 31: Document Classification 1 (Application) Weka Tutorial 32: Document classification 2 (Application) WEKA Text Classification for First Time & Beginner Users You might...

Scikit-learn's Pipeline: Error with multilabel classification. A sparse matrix was passed

python,scikit-learn,gaussian,text-classification

You can do the following: class DenseTransformer(TransformerMixin): def transform(self, X, y=None, **fit_params): return X.todense() def fit_transform(self, X, y=None, **fit_params): self.fit(X, y, **fit_params) return self.transform(X) def fit(self, X, y=None, **fit_params): return self classifier = Pipeline([ ('vectorizer', CountVectorizer ()), ('TFIDF', TfidfTransformer ()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier (GaussianNB()))]) classifier.fit(X_train,Y) predicted = classifier.predict(X_test) Now,...

Naive Bayes: Imbalanced Test Dataset

python,machine-learning,classification,scikit-learn,text-classification

You have encountered one of the problems with classification with a highly imbalanced class distribution. I have to disagree with those that state the problem is with the Naive Bayes method, and I'll provide an explanation which should hopefully illustrate what the problem is. Imagine your false positive rate is...

Parameters grid search with different text sets for dictionary creation and cross validation

python,email,pandas,scikit-learn,text-classification

A simple solution solution forces the fit data to be constant irrespective of the cross-validation data: X_all = full dataset class MyVectorizer(sklearn.feature_extraction.text.TfidfVectorizer): def fit(self, X, y=None): return super(MyVectorizer, self).fit(X_all) def fit_transform(self, X, y=None): return super(MyVectorizer, self).fit(X_all).transform(X) Use this in place of the 'words' sub-pipeline above. An arguably less hacky, but...

In what order does .find() return MongoDB documents? [duplicate]

python,mongodb,document,text-classification

I think collection.find() return based on id. In mongodb ObjectId is based on time.So it should be LIFO(Last in First Out).So sorting is based on time. In your set, there should be no overlap...

how can I complete the text classification task using less memory

python,memory,numpy,scikit-learn,text-classification

The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of...

Authorship Attribution using Machine Learning [closed]

machine-learning,classification,text-mining,text-classification,document-classification

I think Naive Bayes can give you insights. One more way can be , find out features which separate such books ex 1. Complexity of words , some writers are easy to understand and use common words , i am hinting towards IDF (Inverse document frequency) 2. Some words may...