Menu
  • HOME
  • TAGS

Enforcing that inputs sum to 1 and are contained in the unit interval in scikit-learn

python,numpy,encoding,machine-learning,scikit-learn

If I understand your question correctly, then it is simply quadratic programming - all the constraints you mentioned (both equalities and inequalities) are linear.

What's the most pythonic way to load a matrix in ijv/coo/triplet format?

python,pandas,scipy,scikit-learn

You need to define an unambiguous mapping from the row/column names to some indices (it is not important whether "Apple" is "0", or "1", just that it is represented by a number, hence this won't exactly match your result, but it should not matter). In this example, 'info.txt' contains Apple,Google,1...

NaiveBayes classifier handling different data types in python

python,scikit-learn,gaussian,naivebayes

Yes, you will need to convert the strings to numerical values The naive Bayes classifier can not handle strings as there is not a way an string can enter in a mathematical equation. If your strings have some "scalar value" for example "large, medium, small" you might want to classify...

Output the subset of instances used to train each base_estimator of a BaggingClassifier

python,pandas,machine-learning,scikit-learn

It looks like I have answered my own question. the estimators_samples_ attribute of DecisionTreeClassifier is what I want.

One Hot Encoding for representing corpus sentences in python

python,machine-learning,nlp,scikit-learn,one-hot

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix. Example code for two simple...

How to get the number of components needed in PCA with all extreme variance?

scikit-learn,pca

I am not using Python, but I did something you need in C++ & opencv. Hope you succeed in converting it to whatever language. // choose how many eigenvectors you want: int nEigensOfInterest = 0; float sum = 0.0; for (int i = 0; i < mEiVal.rows; ++i) { sum...

Efficient element-wise function computation in Python

python,numpy,scikit-learn,vectorization

np.vectorize does make some improvement in speed - about 2x (here I'm using math.atan2 as an black box function that takes 2 scalar arguments). In [486]: X=np.linspace(0,1,100) In [487]: K_vect=np.vectorize(math.atan2) In [488]: timeit proxy_kernel(X,X,math.atan2) 100 loops, best of 3: 7.84 ms per loop In [489]: timeit K_vect(X[:,None],X) 100 loops, best...

RandomForestClassfier.fit(): ValueError: could not convert string to float

python,scikit-learn,random-forest

You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this. There are several classes that can be used : LabelEncoder : turn your string into incremental value OneHotEncoder : use One-of-K algorithm to transform your String into integer...

Provide Starting Positions to t-distributed Stochastic Neighbor Embedding (TSNE) in scikit-learn

python,scikit-learn

It is currently not possible, but it would be a two-line change. I think it would be a good addition, and we do support init=array for things like k-means. So PR welcome.

How to make a binary decomposition of pandas.Series column

python,pandas,machine-learning,scikit-learn

I don't think there is a way to do this automatically: In [11]: res = df.pop("A").str.get_dummies() # Note: pop removes column A from df In [12]: res.columns = res.columns.map(lambda x: "A_" + x) In [13]: res Out[13]: A_a A_b A_c 0 1 0 0 1 0 1 0 2 1...

How does sklearn.cluster.KMeans handle an init ndarray parameter with missing centroids (available centroids less than n_clusters)?

python,scikit-learn,k-means

If you give it a mismatching init it will adjust the number of clusters, as you can see from the source. This is not documented and I would consider it a bug. I'll propose to fix it.

How to get Best Estimator on GridSearchCV (Random Forest Classifier Scikit)

python,scikit-learn,random-forest,cross-validation

You have to fit your data before you can get the best parameter combination. from sklearn.grid_search import GridSearchCV from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=False) rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt'...

TypeError: get_params() missing 1 required positional argument: 'self'

python,scikit-learn

This error is almost always misleading, and actually means that you're calling an instance method on the class, rather than the instance (like calling dict.keys() instead of d.keys() on a dict named d).* And that's exactly what's going on here. The docs imply that the best_estimator_ attribute, like the estimator...

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

scikit-learn

It looks like sklearn requires the data shape of (row number, column number). If your data shape is (row number, ) like (999, ), it does not work. By using numpy.reshape, you should change to (999, 1). Ex. data.reshape((999,1)) In my case, it worked with that.

Find the tf-idf score of specific words in documents using sklearn

python,scikit-learn,tf-idf

Yes. See .vocabulary_ on your fitted/transformed TF-IDF vectorizer. In [1]: from sklearn.datasets import fetch_20newsgroups In [2]: data = fetch_20newsgroups(categories=['rec.autos']) In [3]: from sklearn.feature_extraction.text import TfidfVectorizer In [4]: cv = TfidfVectorizer() In [5]: X = cv.fit_transform(data.data) In [6]: cv.vocabulary_ It is a dictionary of the form: {word : column index in...

Skimage Python33 Canny

python-3.x,numpy,scikit-learn,python-3.3

I suspect image1.jpg is a color image, so im is 3D, with shape (num_rows, num_cols, num_color_channels). One option is to tell imread to flatten the image into a 2D array by giving it the argument flatten=True: im = misc.imread('image1.jpg', flatten=True) Or you could apply canny to just one of the...

Recover named features from L1 regularized logistic regression

python,machine-learning,scikit-learn

features = pipeline.named_steps['tfidf'].get_feature_names() print(features[pipeline.named_steps['l1'].coef_ != 0]) See TfidfTransformer docs, LogisticRegression docs and the unmerged improved pipeline docs here...

SciKit-learn for data driven regression of oscillating data

python,time-series,scikit-learn,regression,prediction

Here is my guess about what is happening in your two types of results: .days does not convert your index into a form that repeats itself between your train and test samples. So it becomes a unique value for every date in your dataset. As a consequence your models either...

scikit-learn pipeline

python,scikit-learn,pipeline,feature-selection

Unfortunately, this is currently not as nice as it could be. You need to use FeatureUnion to concatenate to kinds of features, and the transformer in each needs to select the features and transform them. One way to do that is to make a pipeline of a transformer that selects...

sklearn.cross_validation.StratifiedShuffleSplit - error: “indices are out-of-bounds”

python,pandas,scikit-learn

You're running into the different conventions for Pandas DataFrame indexing versus NumPy ndarray indexing. The arrays train_index and test_index are collections of row indices. But data is a Pandas DataFrame object, and when you use a single index into that object, as in data[train_index], Pandas is expecting train_index to contain...

Scikit learn cross validation split

python,scikit-learn,classification,cross-validation

In the documentation for this function, it states for the cv arg: cv : cross-validation generator or int, optional, default: None A cross-validation generator to use. If int, determines the number of folds in StratifiedKFold if y is binary or multiclass and estimator is a classifier, or the number of...

Using Cross-Validation on a Scikit-Learn Classifer

python,scikit-learn,cross-validation

For what you're describing, you just need to use train_test_split with a following split on its results. Adapting the tutorial there, start with something like this: import numpy as np from sklearn import cross_validation from sklearn import datasets from sklearn import svm iris = datasets.load_iris() iris.data.shape, iris.target.shape ((150, 4), (150,))...

Differences in sklearn's RidgeCV options

scikit-learn

Several things are happening here, I believe. There is correlation structure in the data, and cv=10 doesn't shuffle. If you shuffle the data before using boston, target = shuffle(boston, target, random_state=3) It seems very dependent on the random state, though. I don't know why the eigen mode is so much...

'verbose' argument in scikit-learn

python,arguments,scikit-learn,verbosity,verbose

Higher integers map to higher verbosity as the docstring says. You can set verbosity=100 but I'm pretty sure it will be the same as verbosity=10. If you are looking for a list of what exactly is printed for each estimator for each integer, you have to look into the source....

Clustering cosine similarity matrix

python,math,scikit-learn,cluster-analysis,data-mining

You can easily do this using spectral clustering. You can use the ready implementations such as the one in sklearn or implement it yourself. It is rather easy an easy algorithm. Here is a piece of code doing it in python using sklearn: import numpy as np from sklearn.cluster import...

scikit learn documentation in PDF

python-2.7,pdf,scikit-learn,html-to-pdf

User docs can be found here http://sourceforge.net/projects/scikit-learn/files/documentation/ also, the older versions. Also, you can create up-to-date offline HTML docs from Github https://github.com/scikit-learn/scikit-learn/tree/master/doc To generate the full web page, including the example gallery (this might take a while): make html Or, if you'd rather not build the example gallery: make html-noplot...

Python Scikit Decision Tree with variable number of outputs

python,scikit-learn,decision-tree

It sounds like you are trying to do multi-label classification, not multi-output classification. Multi-label can be most easily done by providing an indicator vector that says for each sample and each class whether they are in the class or not, so you get a binary array (0 for not in...

Classification with scikit-learn KNN using multi-dimensional features (input dimension error)

python,machine-learning,scikit-learn

Sounds like you want to use features which are themselves multi-dimensional. I'm not sure this works. Consider the increase in complexity that would occur for a distance-based metric like KNN; multi-dimensional features would require distance metrics and would get a lot more involved. I'd first try just flattening the arrays,...

Problems obtaining most informative features with scikit learn?

python,pandas,machine-learning,nlp,scikit-learn

To solve this specifically for linear SVM, we first have to understand the formulation of the SVM in sklearn and the differences that it has to MultinomialNB. The reason why the most_informative_feature_for_class works for MultinomialNB is because the output of the coef_ is essentially the log probability of features given...

How does the class_weight parameter in scikit-learn work?

python,scikit-learn

First off, it might not be good to just go by recall alone. You can simply achieve a recall of 100% by classifying everything as the positive class. I usually suggest using AUC for selecting parameters, and then finding a threshold for the operating point (say a given precision level)...

Can one train estimators in a scikit-learn pipeline simultaneously?

python,machine-learning,scipy,scikit-learn,pipeline

You can write your own transformer that'll transform input into predictions. Something like this: class PredictionTransformer(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin): def __init__(self, estimator): self.estimator = estimator def fit(self, X, y): self.estimator.fit(X, y) return self def transform(self, X): return self.estimator.predict_proba(X) Then you can use FeatureUnion to glue your transformers together. That said, there's a...

Scikit learn, fitting a gaussian to a histogram

python,scikit-learn

Perhaps you are confusing the concept of optimising a statistical model from a set of data points and fitting a curve through a set of data points. Some of the scikit-learn code that is cited above, is trying to optimise a statistical model from a set of data points. In...

How to use meshgrid with large arrays in Matplotlib?

python,arrays,matplotlib,scipy,scikit-learn

As @wflynny says above, you need to give np.meshgrid two 1D arrays. We can use X.shape to create your x and y arrays, like this: X=np.zeros((100,85)) # just to get the right shape here print X.shape # (100, 85) x=np.arange(X.shape[0]) y=np.arange(X.shape[1]) print x.shape # (100,) print y.shape # (85,) xx,yy=np.meshgrid(x,y,indexing='ij')...

How to change the function a random forest uses to make decisions from individual trees?

scikit-learn,classification,random-forest,ensemble-learning

You can access the individual decision trees in the estimators_ attribute of a fitted random forest instance. You can even re-sample that attribute (it's just a Python list of decision tree objects) to add or remove trees and see the impact on the quality of the prediction of the resulting...

Differences between each F1-score values in sklearns.metrics.classification_report and sklearns.metrics.f1_score with a binary confusion matrix

python,scikit-learn,confusion-matrix

I think that 0.695652 is the same thing with 0.70. In the scikit-learn f1_score documentation explains that in default mode : F1 score gives the positive class in binary classification. Also you can easily reach the score of 0.86 with the formulation of F1 score. The formulation of F1 score...

Does scikit-learn perform “real” multivariate regression (multiple dependent variables)?

python,machine-learning,scikit-learn,linear-regression,multivariate-testing

This is a mathematical/stats question, but I will try to answer it here anyway. The outcome you see is absolutely expected. A linear model like this won't take correlation between dependent variables into account. If you had only one dependent variable, your model would essentially consist of a weight vector...

Scikit Learn Logistic Regression confusion

python-3.x,scikit-learn,logistic-regression

The problem you are facing is related to the fact that scikit learn is using regularized logistic regression. The regularization term allows for controlling the trade-off between the fit to the data and generalization to future unknown data. The parameter C is used to control the regularization, in your case:...

Multi-Class Classification in WEKA

machine-learning,scikit-learn,classification,weka,libsvm

You can look at RandomForest which is a well known classifier and quite efficient. In scikit-learn, you have some class that can be used over several core like RandomForestClassifier. It has a constructor parameter that can be used to define the number of core or a value that will use...

Can't import GMM function from sckits.learn

scikit-learn

That link is very old, the module name was renamed to sklearn as you have installed version 0.16.1 you should be using from sklearn.mixture import GMM as per the docs...

Controlling the posterior probabilty threshold for LDA and QDA in scikit-learn

scikit-learn

You can change the decision threshold by using the lda.predict_proba and then thresholding the probability manually: lda = LDA().fit(X_train, y_train) probs_positive_class = lda.predict_proba(X_test)[:, 1] # say default is the positive class and we want to make few false positives prediction = probs_positive_class > .9 This will give you a very...

How to handle data with names in scikit-learn?

python,python-2.7,scikit-learn

It is currently not possible to attach names or properties to rows in scikit-learn. This will change soon (https://github.com/scikit-learn/scikit-learn/issues/4497). But for now, it is really easy to keep track of this yourself. The order of the data points is the same as the order of the cluster labels you get...

could i re-initilize the sklearn library

python,scikit-learn

You most likely have an older version of scikit-learn. You can check the current version using: python -c "import sklearn as sk; print sk.__version__" If you're using 0.16.1, you should be able to import LSHForest. ...

scikit-learn: Random forest class_weight and sample_weight parameters

python,scikit-learn

RandomForests are built on Trees, which are very well documented. Check how Trees use the sample weighting: User guide on decision trees - tells exactly what algorithm is used Decision tree API - explains how sample_weight is used by trees (which for random forests, as you have determined, is the...

using OneHotEncoder with sklearn_pandas DataFrameMapper

scikit-learn

The official version of sklearn-pandas have some problems when dealing with one-dimensional arrays and transformations. Try the following fork: https://github.com/dukebody/sklearn-pandas However, I think you can accomplish what you want using LabelBinarizer (as in the sklearn_pandas examples) instead of OneHotEncoder....

python sklearn cross_validation /number of labels does not match number of samples

python,scikit-learn,cross-validation

Your variables don't appear to match the return pattern for train_test_split Try: features_train, features_test, labels_train, labels_test = ... ...

Only ignore stop words for ngram_range=1

python,nlp,scikit-learn

You have at least 2 options: combine 2 kinds of features with FeatureUnion: one for ngram_range of (1,1) with stop words and one for ngram_range of (2,3) without stop words (more efficient, but harder to implement and use) implement your own analyzer that will check for presence in stop word...

Memory issue sklearn pairwise_distances calculation

python,out-of-memory,fork,scikit-learn,cosine-similarity

What version of scikit-learn are you using? And does it run with n_jobs=1? The result should fit in memory, it is 8 * 42588 ** 2 / 1024 ** 3 = 13 Gb. But the data is about 2gb, and will be replicated to each core. So if you have...

Avarage values of precision, recall and fscore for each label

scikit-learn,classification,cross-validation,precision-recall

You can consider using all the method in sklearn.metrics package. I think this method could do the work you expect. It gives you a 2D array with one row for each target unique value and columns for precision, recall, fscore and support. For fast logging you can use classification_report too....

Approach where features are combination of text(labels) and numerical

python,scikit-learn

It looks like you are looking for OneHotEncoder. For an explanation take a look at the Encoding categorical features section of the docs. The idea is that you will make a column for each city with 0/1 values if the sample belongs to the current city. You might also be...

Scikit learn SGDClassifier: precision and recall change the values each time

scikit-learn,classification,precision-recall

From the docs SGDClassifier has a random_state param that is initialised to None this is a seed value used for the random number generator. You need to fix this value so the results are repeatable so set random_state=0 or whatever favourite number you want clf = SGDClassifier(loss="log", penalty="elasticnet",n_iter=70, random_state=0) should...

load_files in scikit-learn not loading all files in directory

python,machine-learning,dataset,scikit-learn,classification

Your data is stored in data.data and target in data.target. Try print(len(data.data)) instead. load_files() simply returns a sklearn.datasets.base.Bunch, which is a simple data wrapper. So, data is in this format: { 'DESCR': None, 'data': [], 'filenames': array(), 'target': array(), 'target_names': [] } This is why len(data) returns 5. Hope this...

Add Features to An Sklearn Classifier

python,machine-learning,nlp,scikit-learn

You can use feature union http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces There is a nice example in the documentation http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#example-hetero-feature-union-py which I think exactly fits your requirements. See TextStats transformer. Regards,...

Scikit-Learn: How to retrieve prediction probabilities for a KFold CV?

python,scikit-learn,classification

You need to do a GridSearchCrossValidation instead of just CV. CV is used for performance evaluation and itself doesn't fit the estimator actually. from sklearn.datasets import make_classification from sklearn.svm import SVC from sklearn.grid_search import GridSearchCV # unbalanced classification X, y = make_classification(n_samples=1000, weights=[0.1, 0.9]) # use grid search for tuning...

Using the predict_proba() function of RandomForestClassifier in the safe and right way

python,machine-learning,scikit-learn,random-forest

I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit,...

ImportError: No module named sklearn.cross_validation

python,scikit-learn

I have same configuration,both python and Ubuntu,works fine for me.Do you have anaconda installed? Python 2.7.9 |Anaconda 2.2.0 (64-bit)| (default, Mar 9 2015, 16:20:48) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please...

How to do LabelEncoding or categorical value in Apache Spark

apache-spark,scikit-learn

We're developing sparkit-learn which aims to provide scikit-learn functionality and API on PySpark. You can use SparkLabelEncoder the following way: $ pip install sparkit-learn >>> from splearn.preprocessing import SparkLabelEncoder >>> from splearn import BlockRDD >>> >>> data = ["paris", "paris", "tokyo", "amsterdam"] >>> y = BlockRDD(sc.parallelize(data)) >>> >>> le =...

Finding a corresponding leaf node for each data point in a decision tree (scikit-learn)

python,machine-learning,scikit-learn,decision-tree

I finally got it to work. Here is one solution based on my correspondence message in the scikit-learn mailing list: After scikit-learn version 0.16.1, apply method is implemented in clf.tree_, therefore, I followed the following steps: update scikit-learn to the latest version (0.16.1) so that you can use apply method...

Problems vectorizing specific columns with scikit learn DictVectorizer?

python,python-2.7,pandas,machine-learning,scikit-learn

You need to pass a list: test_matrix = dict_vect.transform(testing_data[['G1','G2','sex','school','age']]) What you did was try to index your df with the keys: ['G1','G2','sex','school','age'] which is why you get a KeyError as there is no such single column named like the above, to select multiple columns you need to pass a list...

sklearn Imputer() returned features does not fit in fit function

python,machine-learning,scikit-learn

With scikit-learn, initialising the model, training the model and getting the predictions are seperate steps. In your case you have: train_fea = np.array([[1,1,0],[0,0,1],[1,np.nan,0]]) train_fea array([[ 1., 1., 0.], [ 0., 0., 1.], [ 1., nan, 0.]]) #initialise the model imp = Imputer(missing_values='NaN', strategy='mean', axis=0) #train the model imp.fit(train_fea) #get the...

Bag of words representation using sklearn plus Snowballstemmer

python,python-3.x,scikit-learn,nltk

You can pass a custom preprocessor which should work just as well, but retain the functionality of the tokenizer: from sklearn.feature_extraction.text import CountVectorizer from nltk.stem import SnowballStemmer list2 = ["rain", "raining", "rainy", "rainful", "rains", "raining!", "rain?"] def preprocessor(data): return " ".join([SnowballStemmer("english").stem(word) for word in data.split()]) vectorizer = CountVectorizer(preprocessor=preprocessor).fit(list2) print vectorizer.vocabulary_...

How do I use the ML sklearn pipeline to predict?

scikit-learn

You have to fit the pipeline first, like you fit any model/transformer: pipe.fit(traindf[features], traindf[labels]) ...

Port Python Code to Android

android,python,numpy,scikit-learn

From the sound of it I'd suggest making a simple Django server, free hosting available at pythonanywhere.com If you don't know Django just go through the basic tutorial. Follow it step by step and you'll be a set within just a few hours. Follow the tutorial and implement the example...

Normalization in Sklearn KNN

python-2.7,scikit-learn,classification,knn

It is not automatically done in sklearn. However sklearn provides tools to help you normalize your data, which you can use in sklearn's pipelines.

Why does not GridSearchCV give best score ? - Scikit Learn

python,r,machine-learning,scikit-learn,regression

The best_score_ is the best score from the cross-validation. That is, the model is fit on part of the training data, and the score is computed by predicting the rest of the training data. This is because you passed X_train and y_train to fit; the fit process thus does not...

I'm not sure how to interpret accuracy of this classification with Scikit Learn

python,machine-learning,scikit-learn,classification,text-classification

text_data = load_files("C:/Users/USERNAME/projects/machine_learning/my_project/train", ...) According to the documentation, that line loads your file's contents from C:/Users/USERNAME/projects/machine_learning/my_project/train into text_data.data. It will also load target labels (represented by their integer indexes) for each document into text_data.target. So text_data.data should be a list of strings and text_data.target a list of integers. The labels...

Difference between using train_test_split and cross_val_score in sklearn.cross_validation

python,scikit-learn,cross-validation

When using cross_val_score, you'll frequently want to use a KFolds or StratifiedKFolds iterator: http://scikit-learn.org/0.10/modules/cross_validation.html#computing-cross-validated-metrics http://scikit-learn.org/0.10/modules/generated/sklearn.cross_validation.KFold.html#sklearn.cross_validation.KFold By default, cross_val_score will not randomize your data, which can produce odd results like this if you're data isn't random to begin with. The KFolds iterator has a random state parameter:...

How to properly install sklearn on Eclipse

python-2.7,scikit-learn

Try uninstalling NumPy, SciPy and scikit-learn. Then reinstall those packages. If you are using Windows, binaries for those packages have been provided by Christoph Gohlke, which can be quite convenient. Just download them and then use pip to install. Unofficial Windows Binaries for Python Extension Packages...

Scikit : How to resolve this usecase

python,machine-learning,scikit-learn

A simple yet effective idea would be to train separate classifiers for text and numeric data. Make sure you normalize as you go. Now when you have, say, two different classifiers, you can combine their results to predict whether it is a spam or not. Check http://scikit-learn.org/stable/modules/ensemble.html To further improve...

Why does scikit-learn cause core dumped?

python,scikit-learn,coredump

Going out on a limb here, but does your laptop by any chance have an AMD CPU? AMD have removed support for the 3DNow! instructions from their more recent processors (source), which a trawl of Ubuntu and Debian bugtrackers shows that many people are being hit by (eg 1, 2,...

In sklearn, does a fitted pipeline reapply every transform?

python,scikit-learn,pipeline,feature-selection

The pipeline calls transform on the preprocessing and feature selection steps if you call pl.predict. That means that the features selected in training will be selected from the test data (the only thing that makes sense here). It is unclear what you mean by "apply" here. Nothing new will be...

Understand SciKit Learn CV Validation Scores

python,machine-learning,scikit-learn

As I posted on the mailing list, the documentation makes this quite clear: grid_scores_ : list of named tuples Contains scores for all parameter combinations in param_grid. Each entry corresponds to one parameter setting. Each named tuple has the attributes: parameters, a dict of parameter settings mean_validation_score, the mean score...

Accessing transformer functions in `sklearn` pipelines

python,scikit-learn

The Pipeline documentation slightly overstates things. It has all the estimator methods of its last estimator. These include things like predict(), fit_predict(), fit_transform(), transform(), decision_function(), predict_proba().... It cannot use any other functions, because it wouldn't know what to do with all the other steps in the pipeline. For most situations,...

How to interpret scikit's learn confusion matrix and classification report?

machine-learning,nlp,scikit-learn,svm,confusion-matrix

Classification report must be straightforward - a report of P/R/F-Measure for each element in your test data. In Multiclass problems, it is not a good idea to read Precision/Recall and F-Measure over the whole data any imbalance would make you feel you've reached better results. That's where such reports help....

Generating Difficult Classification Data Sets using scikit-learn

scikit-learn

Generally, the combination of a fairly low number of n_samples, a high probability of randomly flipping the label flip_y and a large number of n_classes should get you where you want. You can try the following: from sklearn.cross_validation import cross_val_score from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression lr =...

Calling scikit-learn functions from C++

python,c++,opencv,boost,scikit-learn

You can do it in a pretty straightforward way, by including python headers and just calling your python script and/or scikit methods via Py* wrappers. See https://docs.python.org/2/extending/embedding.html#pure-embedding for a thorough example....

Python LSA with Sklearn

python,scikit-learn,lsa

SVD is a dimensionality reduction tool, which means it reduces the order (number) of your features to a more representative set. From the source code on github: def fit_transform(self, X, y=None): """Fit LSI model to X and perform dimensionality reduction on X. Parameters ---------- X : {array-like, sparse matrix}, shape...

Plotting a ROC curve in scikit yields only 3 points

python,validation,machine-learning,scikit-learn,roc

The number of points depend on the number of unique values in the input. Since the input vector has only 2 unique values, the function gives correct output.

Principal component analysis using sklearn and panda

python,pandas,scikit-learn,pca,principal-components

You can try this. import pandas as pd import matplotlib.pyplot as plt from sklearn import decomposition # use your data file path here demo_df = pd.read_csv(file_path) demo_df.set_index('Unnamed: 0', inplace=True) target_names = demo_df.index.values tran_ne = demo_df.values pca = decomposition.PCA(n_components=4) pcomp = pca.fit_transform(tran_ne) pcomp1 = pcomp[:,0] fig, ax = plt.subplots() ax.scatter(x=pcomp1[0], y=0,...

How to list all scikit-learn classifiers that support predict_proba()

python,scikit-learn

from sklearn.utils.testing import all_estimators estimators = all_estimators() for name, class_ in estimators: if hasattr(class_, 'predict_proba'): print(name) You can also use CalibratedClassifierCV to make any classifier into one that has predict_proba. This was asked before on SO, but I can't find it, so you should be excused for the duplicate ;)...

Create classes in a loop

python,class,scikit-learn

In your orignal code, you created one instance and used it five times. Instead, you want to initialize the class only when you add it to the model_types array, as in this code. class xyz(object): def __init__(self): self.model_type = ensemble.RandomForestClassifier self.model_types = {} self.model = {} for x in range(0,5):...

How can i resample “roc_curve” (fpr,tpr )?

python,scikit-learn

I think what you are looking for is piecewise constant interpolation. import numpy as np from scipy.interpolate import spline fpr =[0,0.1,0.4,0.9,1] tpr =[0,0.3,0.4,0.5,1] n = 20 x_interp = np.linspace(0,1,n+1) y_interp = spline(fpr, tpr, x_interp, order=0) x_interp are the fpr values [ 0. 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4...

Scikit: Remove feature row if present in all documents

python,machine-learning,scikit-learn

Try this: goodwords = ((countmatrix > 1).mean(axis=0) <= 0.8).nonzero()[0] It first computes a Boolean matrix which is True if countmatrix > 1 and computes the column-wise mean of it. If the mean is less than 0.8 (80%), the corresponding column index is returned by nonzero(). So, goodwords will contain all...

Error importing scikit-learn modules

python,scikit-learn

Problem was with scipy/numpy install. I'd been using the (normally excellent!) unofficial installers from http://www.lfd.uci.edu/~gohlke/pythonlibs/. Uninstall/re-install from there made no difference, but installing with the official installers (linked from http://www.scipy.org/install.html) did the trick.

How to get feature names corresponding to scores for chi square feature selection in scikit

python,scikit-learn,chi-squared

It's right there in the documentation: get_feature_names()...

Can I use scikit-learn with Django framework?

django,data,scikit-learn,analysis

Django is a python framework meaning that you need to have python installed to use it. Once you have python, you can use whatever python package you want (compatible with the version of python you are using).

How to calculate tf-idf for a list of dict?

python,scipy,scikit-learn

First convert your dictionary into a list of the strings by : X_all = list(d.values()) Build the tfIDFVectoriser function as : from sklearn.feature_extraction.text import TfidfVectorizer tfv = TfidfVectorizer(min_df=3, max_features=None, strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}', ngram_range=(1,2), use_idf=1,smooth_idf=1,sublinear_tf=1, stop_words = 'english') and then you can build your model as : X_all = tfv.transform(X_all) where X_all...

Why is TfidfVectorizer in scikit-learn showing this behavior?

python-2.7,scikit-learn,tf-idf

It is advisable to prepend regular expressions with r, this should work: vectorizer2 = TfidfVectorizer(token_pattern=r'(?u)\b\w\w+\b', ngram_range=(1, 2), max_df=1.0, min_df=1) train_set_tfidf = vectorizer2.fit_transform(train_set) This is a known bug in the documentation, but if you look at the source code they do use raw literals....

Undefined symbols in Scipy and Scikit-learn on RedHat

python,scipy,scikit-learn,atlas

Thanks to Andreas Mueller for the tip: doing a local install of anaconda to a directory I owned got around the compilation issues.

Python thread locking/class variable initialisation confusion

python,multithreading,class,locking,scikit-learn

Threading by itself does not speed up python processes whose bottleneck is CPU and not the IO(read/write) because of the global interpreter lock (GIL). To actually get the speedup sklearn uses multiprocessing for parallelization. This is different from threading in that the objects are copied into a separate process, and...

scikit : Wrong prediction for this case

python,machine-learning,scikit-learn

This is expected (or at least not so unexpected) behavior with the code you have written: you have two instances labeled as dog in which you have the term this is, so the algorithm learns that this is is related to dog. It might not be what you're after, but...

load data from csv into Scikit learn SVM

python,csv,numpy,scikit-learn

You passed the param delimiter=',' but your csv was not comma separated. So the following works: In [378]: data = np.loadtxt(path_to_data) data Out[378]: array([[ 1.20000000e+01, 1.22000000e+02, 3.40000000e+01], [ 1.22340000e+04, 5.40000000e+01, 2.30000000e+01], [ 2.30000000e+01, 3.40000000e+01, 2.30000000e+01]]) The docs show that by default the delimiter is None and so treats whitespace as...

Lasso Generalized linear model in Python

python,statistics,scikit-learn,statsmodels,cvxopt

statsmodels has had for some time a fit_regularized for the discrete models including NegativeBinomial. http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.NegativeBinomial.fit_regularized.html which doesn't have the docstring (I just saw). The docstring for Poisson has the same information http://statsmodels.sourceforge.net/devel/generated/statsmodels.discrete.discrete_model.Poisson.fit_regularized.html and there should be some examples available in the documentation or unit tests. It uses an interior...

python: How to get real feature name from feature_importances

python,scikit-learn,classification,feature-selection

The feature_importances_ method returns the relative importance numbers in the order the features were fed to the algorithm. So in order to get the top 20 features you'll want to sort the features from most to least important for instance like this: importances = forest.feature_importances_ indices = numpy.argsort(importances)[-20:] ([-20:] because...

NearestNeighbors tradeoff - run faster with less accurate results

python,scikit-learn,smooth,nearest-neighbor

First of all, why to use a Ball tree? Maybe your metric does imply this to you, but if that's not the case, you could use a kd-tree too. I will approach your question from a theoretic point of view. The radius parameter is set 1.0 by default. This might...

How to specify the prior probability for scikit-learn's Naive Bayes

python,syntax,machine-learning,scikit-learn

The GaussianNB() implemented in scikit-learn does not allow you to set class prior. If you read the online documentation, you see .class_prior_ is an attribute rather than parameters. Once you fit the GaussianNB(), you can get access to class_prior_ attribute. It is calculated by simply counting the number of different...

Identifying a sklearn-model's classes

python,scikit-learn,svm

The grid_search.estimator that you are looking at it the unfitted pipeline. The classes_ attribute only exists after fitting, as the classifier needs to have seen y. What you want it the estimator that was trained using the best parameter settings, which is grid_search.best_estimator_. The following will work: clf = grid_search.best_estimator_.named_steps['svm']...

series object not callable with linear regression in python

python,scikit-learn,linear-regression,statsmodels

Try lm.params instead of lm.params() The latter tries to call the params as a function (which it isn't)...