java,weka,text-classification,arff
You can use Weka's StringToWordVector filter to convert the text into a word vector (but not necessarily a sparse matrix). Take a look at my tutorial on this.
python,machine-learning,scikit-learn,text-classification
So, it's impossible to say without actually seeing the source code of setup_data, but I have a pretty decent guess as to what is going on here. sklearn follows the fit_transform format, meaning there are two stages, specifically fit, and transform. In the example of the CountVectorizer the fit stage...
machine-learning,weka,data-mining,rapidminer,text-classification
The simplest way is to break the dataset into binary problems. If for example you have the datasets text1: science text2: sports, politics Break the dataset into 3 datasets: dataset1 (science): text1:true, text2:false dataset2 (sports): text2:false, text2:true dataset3 (science): text1:false, text2:true Create 3 binary classifiers, one for each class, use...
Since you are dealing with classification, it might be interesting for you to have a look at AlchemyAPI, http://www.alchemyapi.com/products/features/. You have a free api key where you can try things. But this doesn't stops here, if you want to do it manually, as your can see in @tripleee answer, WordNet...
machine-learning,nltk,stanford-nlp,text-classification,naivebayes
If you know how to train and predict texts (or sentences in your case) using nltk's naive bayes classifier and words as features, than you can easily extend this approach in order to classify texts by pos-tags. This is because the classifier don't care about whether your feature-strings are words...
python,classification,nltk,text-classification
Some background, the OP purpose is to build a classifier for this purpose: https://github.com/alejandrovega44/CSCE-470-Anime-Recommender Firstly, there are several methodological issues, i terms of what you're calling things. You training data should be the raw data you're using for your task, i.e. the json file at: https://raw.githubusercontent.com/alejandrovega44/CSCE-470-Anime-Recommender/naive2/py/UserAnime2 And the data structure...
command-line,weka,text-classification
Could not get the batch filter to work with any resampling filter. However, our workaround was to simply resample (and then randomize) the training data as step 1. From this reduced set we ran batch filters for everything else we wanted on the test set. This seemed to work fine.
This is not a so trivial problem. In fact, it's a standard question in Machine Learning, so there isn't a dirty easy solution.
Below is a method that will help you to solve your problem. So, You can edit it to reach your aim. public void showPredictions( ){ BufferedReader reader=null; reader= new BufferedReader(new FileReader("SparseDTM.arff")); Instances data = new Instances(reader); double[] predictions; try { NaiveBayes classifier = new NaiveBayes(); classifier.buildClassifier(data); predictions = eval.evaluateModel(classifier, data...
python,machine-learning,scikit-learn,classification,text-classification
text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", ...) According to the documentation, that line loads your file's contents from C:/Users/USERNAME/projects/machine_learning/my_project/train into text_data.data. It will also load target labels (represented by their integer indexes) for each document into text_data.target. So text_data.data should be a list of strings and text_data.target a list of integers. The labels...
stop-words,lda,topic-modeling,text-classification
I am working on a similar problem, but of text categorization. From my experience, it is good to have a domain specific set of stop word list along with the standard . list. Otherwise, these words like "introduction","review" etc. will come up in the term frequency matrix, if you have...
matlab,classification,text-classification
Here is an example of Naive Bayes classification, x1 = 5 * rand(100,1); y1 = 5 * rand(100,1); data1 = [x1,y1]; x2 = -5 * rand(100,1); y2 = 5 * rand(100,1); data2 = [x2,y2]; x3 = -5 * rand(100,1); y3 = -5 * rand(100,1); data3 = [x3,y3]; traindata = [data1(1:50,:);data2(1:50,:);data3(1:50,:)];...
scikit-learn,tf-idf,text-classification
if I understand you correctly. You can fit your TFIDF model to all of your data and then call transform on the smaller tagged corpus: vec =TfidfVectorizer() model = vec.fit(alldata) tagged_data_tfidf = vec.transform(tagged_data) ...
python,algorithm,nlp,text-classification,naivebayes
As Ed Cottrell commented, you need to consider what happens if you encounter a word that is not in the documents in a category. You can avoid multiplying by 0 by using Laplace smoothing. If you see a word in k out of n documents in a category, you assign...
machine-learning,weka,sentiment-analysis,text-classification
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying...
I am attempting to answer the question using a different text classification task than spam classification. Say, I have the following training data: "The US government had imposed extra taxes on crude oil", petrolium "The German manufacturers are observing different genes of Canola oil", non-petrolium And the following test data:...
java,weka,text-classification,categorization
It seems like you changed the code from the website you referenced in some crucial points, but not in a good way. I'll try to draft what you're trying to do and what mistakes I've found. What you (probably) wanted to do in extractFeature is Split each tweet into words...
I found the videos below quite helpful when I first got my hands on text classification using Weka. You might want to take a look. Weka Tutorial 31: Document Classification 1 (Application) Weka Tutorial 32: Document classification 2 (Application) WEKA Text Classification for First Time & Beginner Users You might...
python,scikit-learn,gaussian,text-classification
You can do the following: class DenseTransformer(TransformerMixin): def transform(self, X, y=None, **fit_params): return X.todense() def fit_transform(self, X, y=None, **fit_params): self.fit(X, y, **fit_params) return self.transform(X) def fit(self, X, y=None, **fit_params): return self classifier = Pipeline([ ('vectorizer', CountVectorizer ()), ('TFIDF', TfidfTransformer ()), ('to_dense', DenseTransformer()), ('clf', OneVsRestClassifier (GaussianNB()))]) classifier.fit(X_train,Y) predicted = classifier.predict(X_test) Now,...
python,machine-learning,classification,scikit-learn,text-classification
You have encountered one of the problems with classification with a highly imbalanced class distribution. I have to disagree with those that state the problem is with the Naive Bayes method, and I'll provide an explanation which should hopefully illustrate what the problem is. Imagine your false positive rate is...
python,email,pandas,scikit-learn,text-classification
A simple solution solution forces the fit data to be constant irrespective of the cross-validation data: X_all = full dataset class MyVectorizer(sklearn.feature_extraction.text.TfidfVectorizer): def fit(self, X, y=None): return super(MyVectorizer, self).fit(X_all) def fit_transform(self, X, y=None): return super(MyVectorizer, self).fit(X_all).transform(X) Use this in place of the 'words' sub-pipeline above. An arguably less hacky, but...
python,mongodb,document,text-classification
I think collection.find() return based on id. In mongodb ObjectId is based on time.So it should be LIFO(Last in First Out).So sorting is based on time. In your set, there should be no overlap...
python,memory,numpy,scikit-learn,text-classification
The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of...
machine-learning,classification,text-mining,text-classification,document-classification
I think Naive Bayes can give you insights. One more way can be , find out features which separate such books ex 1. Complexity of words , some writers are easy to understand and use common words , i am hinting towards IDF (Inverse document frequency) 2. Some words may...