The title say it all, I have checked scikit docs, which are very poor for this particular task, and I have checked several online resources, including this post.
However, they seem to be wrong. For feature selection we can do something like:
X_reduced = clf.fit_transform(X_full,y_full)
Now, if we inspect the shape of
X_reduced, it is very clear how many features were selected. So the question is now, which ones?
coef_ attribute of the
LinearSVC is very important and is suggested to iterate over it and the features which
coef_ are different from zero are the ones selected. Well, this is wrong, but you can get very close to the real result.
X_reduced, I noticed I got 310 selected features, and that is for sure, I mean, I am checking the resultant matrix, now, if I do the
coef_ thing, 414 features were selected, from a total of 2000, so it is close to the real thing.
According to scikit
LinearSVC docs there is a
mean(X) involved for
Threshold=None but I am stuck, no idea what to do now.
UPDATE: Here is a link with data&code that reproduce the error, its just a few KB
Best How To :
LinearSVC() does returns features with non-zero coefficients. Could you please upload the sample data file and code script (for example, via dropbox sharelink) that can reproduce the inconsistency you saw?
from sklearn.datasets import make_classification
from sklearn.datasets import load_svmlight_file
from sklearn.svm import LinearSVC
import numpy as np
X, y = load_svmlight_file("/home/Jian/Downloads/errorScikit/weirdData")
transformer = LinearSVC(penalty='l1', dual=False, random_state=0)
# set threshold eps
X_reduced = transformer.transform(X, threshold=np.finfo(np.float).eps)
print(str(X_reduced.shape) + " is NOW equal to " + str((transformer.coef_ != 0).sum()))
414 is NOW equal to 414
# as suggested by user3914041, if you want both sides are 310
Out: (62, 310)
(abs(transformer.coef_) > 1e-5).sum()