I have applied random forest on a training data which has about 100 features. Now I would like to apply feature selection technique in order to reduce the number of features before applying random forest model on the data. How can I make use of varImp function (from caret package) to select important features? I read that varImp itself uses some classification method to select features (which I found very counter intuitive). How exactly can I apply varImp to get important subset of features which I then can use while applying random forest classification algorithm?
Best How To :
From the caret
package author Max Khun on feature selection:
Many models that can be accessed using caret's train function produce prediction equations that do not necessarily use all the predictors. These models are thought to have built-in feature selection
And rf
is one of them.
Many of the functions have an ancillary method called predictors
that returns a vector indicating which predictors were used in the final model.
If you want to retrieve importance score in your model, add importance = TRUE
in your train()
call
In many cases, using these models with built-in feature selection will be more efficient than algorithms where the search routine for the right predictors is external to the model. Built-in feature selection typically couples the predictor search algorithm with the parameter estimation and are usually optimized with a single objective function (e.g. error rates or likelihood).