apache-spark,random-forest,mllib,pyspark

As far as I can tell this is not supported in the current version (1.2.1). The Python wrapper over the native Scala code (tree.py) defines only 'predict' functions which, in turn, call the respective Scala counterparts (treeEnsembleModels.scala). The latter make decisions by taking a vote among binary decisions. A much...

python,machine-learning,scikit-learn,random-forest

There is a default setting of ten trees apparently, you are using the default in your code: Parameters: n_estimators : integer, optional (default=10) The number of trees in the forest. Try something like this, increasing the number of trees to 25 or some greater number than 10: RandomForestClassifier(n_estimators=25, n_jobs=2) If...

python,machine-learning,scikit-learn,random-forest

The training examples are stored by row in "csv-data.txt" with the first number of each row containing the class label. Therefore you should have: X_train = my_training_data[:,1:] Y_train = my_training_data[:,0] Note that in the second index in X_train, you can leave off the end index, and the indices will automatically...

r,validation,weka,random-forest,cross-validation

I dont know about weka, but i have done randomForest modelling in R and I have always used predict function in R to do this. Try using this function predict(Model,data) Bind the output with original values and use table command to get the confusion matrix....

python,scikit-learn,random-forest

I agree the first quote is self-contradictory. Maybe the following would be better: The best results are also often reached with fully developed trees (max_depth=None and min_samples_split=1). Bear in mind though that these values are usually not guaranteed to be optimal. The best parameter values should always be cross-validated. For...

python,scikit-learn,random-forest

OneVsRestClassifier has an attribute estimators_ : list of n_classes estimators So to get the feature importance of the ith RandomForest print clf.estimators_[i].feature_importances_ ...

Hopefully this helps. The names are all generic, but if you provide some example code and data I can clarify things. prediction <- predict(your_rf, testdata, type = "response") location <- prediction == testdata$target testdata[location,] ...

python,scikit-learn,random-forest,cross-validation

You have to fit your data before you can get the best parameter combination. from sklearn.grid_search import GridSearchCV from sklearn.datasets import make_classification from sklearn.ensemble import RandomForestClassifier # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=10, n_informative=3, n_redundant=0, n_repeated=0, n_classes=2, random_state=0, shuffle=False) rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt'...

r,variables,machine-learning,neural-network,random-forest

see varImp() in library caret, which supports all the ML algorithms you referenced.

Without knowing what Y is in your specific case, I think it is the source of your error. The documentation says that Y is an array of true class labels... True class labels can be a numeric vector, character matrix, vector cell array of strings or categorical vector. A categorical...

machine-learning,scikit-learn,random-forest

Scikit-learn does use only the bootstrap sample for each tree to calculate the probability estimate of each class, right? No, it uses only the in-sample part, and therefore will not give very calibrated probability outputs (which I guess is what the paper suggests). You could get better probability estimates...

python,python-2.7,scikit-learn,classification,random-forest

I believe this is possible by modifying the estimators_ and n_estimators attributes on the RandomForestClassifier object. Each tree in the forest is stored as a DecisionTreeClassifier object, and the list of these trees is stored in the estimators_ attribute. To make sure there is no discontinuity, it also makes sense...

python,matlab,machine-learning,statistics,random-forest

The Problem There are many reasons why the implementation of a random forest in two different programming languages (e.g., MATLAB and Python) will yield different results. First of all, note that results of two random forests trained on the same data will never be identical by design: random forests often...

I can answer only the first part of your question. how can I attach the prediction results to the data frame To do this you can use the cbind function: Considering the results of your predictions: predict(rf) Turn them into a data frame predResults <- data.frame(predict(rf)) And update your original...

You can exclude variables by changing what's delivered to the function in the data argument. exclude_cols <- c('dummy_48','dummy_50','other_var_to_be_dropped') fit <- randomForest(result ~ ., data=df[ !names(df) %in% exclude_cols ] , importance=TRUE, ntree=2000) The subset argument to this function only works on a row basis....

scikit-learn,classification,random-forest,ensemble-learning

You can access the individual decision trees in the estimators_ attribute of a fitted random forest instance. You can even re-sample that attribute (it's just a Python list of decision tree objects) to add or remove trees and see the impact on the quality of the prediction of the resulting...

I've confirmed with a test case that the prediction for the randomForest regressor does not depend on the decision node values in the prediction field: > x <- data.frame(matrix(rnorm(20), nrow=10)) > y <- rnorm(10) > > model <- randomForest(x,y,ntree=1) > getTree(model,k=1) left daughter right daughter split var split point status...

r,machine-learning,glm,prediction,random-forest

You need to specify type='response' for this to happen: Check this example: y <- rep(c(0,1),c(100,100)) x <- runif(200) df <- data.frame(y,x) fitgbm <- gbm(y ~ x, data=df, distribution = "bernoulli", n.trees = 100) predgbm <- predict(fitgbm, df, n.trees=100, type='response') Too simplistic but look at the summary of predgbm: > summary(predgbm)...

r,formula,random-forest,caret,predict

First, almost never use the $finalModel object for prediction. Use predict.train. This is one good example of why. There is some inconsistency between how some functions (including randomForest and train) handle dummy variables. Most functions in R that use the formula method will convert factor predictors to dummy variables because...

r,random-forest,feature-selection

From the caret package author Max Khun on feature selection: Many models that can be accessed using caret's train function produce prediction equations that do not necessarily use all the predictors. These models are thought to have built-in feature selection And rf is one of them. Many of the functions...

c++,opencv,machine-learning,regression,random-forest

to use it for regression, you just have to set the var_type as CV_VAR_ORDERED i.e. var_type.at<uchar>(ATTRIBUTES_PER_SAMPLE, 0) = CV_VAR_ORDERED; and you might want to set the regression_accuracy to a very small number like 0.0001f. ...

python-2.7,intersection,random-forest

Normalization Start by first normalizing the data in the range 0 to 1. This can be done using the following formula. Norm(e) = (e - Emin) / (Emax - Emin) for each value e in each vector. (I don't know how to put math symbols in here or I would.)...

python,scikit-learn,random-forest

You have to do some encoding before using fit. As it was told fit() does not accept Strings but you solve this. There are several classes that can be used : LabelEncoder : turn your string into incremental value OneHotEncoder : use One-of-K algorithm to transform your String into integer...

python,pandas,scikit-learn,random-forest,feature-selection

Here's what I ginned up. It's a pretty simple solution, and relies on a custom accuracy metric (called weightedAccuracy) since I'm classifying a highly unbalanced dataset. But, it should be easily made more extensible if desired. from sklearn import datasets import pandas from sklearn.ensemble import RandomForestClassifier from sklearn import cross_validation...

r,machine-learning,random-forest

Looks like you want to draw your data only form the training data.frame, so you should not be referencing Aggregate in your formula. Using the variable names actually in your test data, this seems to work just fine. randomForest(as.factor(Is.Customer) ~ oldTitleCount + youngTitleCount + totalActivities + Max.between.2.activities + Time.since100thactivity +...

You are missing the randomForest library. It is one of the suggested libraries in caret and where the rf method comes from. After you install it it should work fine like this: library(randomForest) library(caret) data(iris) inTrain <- createDataPartition(y=iris$Species, p=0.7, list=FALSE) training <- iris[inTrain,] testing <- iris[-inTrain,] # Use o Random...

A sample example for AUC: rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes) library(ROCR) predictions=as.vector(rf_output$votes[,2]) pred=prediction(predictions,target) perf_AUC=performance(pred,"auc") #Calculate the AUC value [email protected][[1]] perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve plot(perf_ROC, main="ROC plot") text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE))) or using pROC and caret library(caret)...

machine-learning,scikit-learn,k-means,random-forest

Be careful in turning a list of string values into categorical ints as the model will likely interpret the integers as being numerically significant, but they probably are not. For instance, if: 'Dog'=1,'Cat'=2,'Horse'=3,'Mouse'=4,'Human'=5 Then the distance metric in your clustering algorithm would think that humans are more like mice than...

Moving answer from comments. Try reducing the number of trees by setting a smaller n_estimators parameter. You can then try to control the tree depth using max_depth or min_samples_split and trade of depth for an increased number of estimators....

Use the formula interface to build the product in the model specification: class_var ~ var1 + var2 + I(var1 * var2) The I function means that the value of the product will be calculated rather than producing the result of interaction which is not the numerical product if either of...

r,classification,bayesian,random-forest

I'm actually working on a similar issue and have codified an R package that runs randomForest as the local classifier along a pre-defined class hierarchy. you can find it in R-Forge under 'hie-ran-forest'. The package includes two ways to turn the local probabilities into a crisp class. stepwise majority rule-...

RandomForest uses bootstrap to create many training sets by sampling the data with replacement (bagging). Each bootstrapped set is very close to the original data, but slightly different, since it may have multiples of the some points and some other points in the original data will be missing. (This helps...

r,random-forest,feature-selection

the caret package has a very useful function called varImp (http://www.inside-r.org/packages/cran/caret/docs/varImp). If you don't have a very large number of predictors you could use it to get/plot their importance. In your case, let's suppose you have trained your model: training svm = classify(data = data.trainS4, method = "svm", normalize =...

Your RF model contains five decision trees. Class probabilities are calculated by dividing the number of decision trees that voted for a particular class by the total number of decision trees. In your example, one decision tree voted for class "versicolor" (1 / 5 = 0.2), and the remaining four...

The following should work: model1 <- train(as.factor(Cover_Type) ~ Elevation + Aspect + Slope + Horizontal_Distance_To_Hydrology, data = data.train, method = "rf", tuneGrid = data.frame(mtry = 3)) Its always better to specify the tuneGrid parameter which is a data frame with possible tuning values. Look at ?randomForest and ?train for more...

scikit-learn,random-forest,cross-validation

Looks like a bug, but in your case it should work if you use RandomForestRegressor's own scorer (which coincidentally is R^2 score) by not specifying any scoring function in GridSearchCV: clf = GridSearchCV(ensemble.RandomForestRegressor(), tuned_parameters, cv=5, n_jobs=-1, verbose=1) EDIT: As mentioned by @jnothman in #4081 this is the real problem: scoring...

You just exchanged new_values by test_new inside rbind. I changed it and tried the code below and could obtain data frame with all the tree data, numbered according to the tree: # Do some setup, and train a basic random forest model library(randomForest) data(iris) model <- randomForest(Species ~ ., data=iris)...

machine-learning,weka,random-forest,decision-tree

When tree is split on numerical attribute, it is split on the condition like a>5. So, this condition effectively becomes binary variable and the criterion (information gain) is absolutely the same. P.S. For regression commonly used is the sum of squared errors (for each leaf, then sum over leaves). But...

Question 1 - why does ntree show 1?: summary(rf) shows you the length of the objects that are included in your rf variable. That means that rf$ntree is of length 1. If you type on your console rf$tree you will see that it shows 800. Question 2 - does a...

The error message is originating from R, there the list expected is an R list. Try using: RFl = robjects.vectors.ListVector([('X%i' % i, x) for i, x in enumerate(RF)]) edit: the constructor for ListVector wants names for the list elements ** 2nd edit:** However, the real path to a solution is...

You need to convert your char columns into factors. Factors are treated as integers internally whereas character fields are not. See the following small demonstration: Data: df <- data.frame(y = sample(0:1, 26, rep=T), x1=runif(26), x2=letters, stringsAsFactors=F) df$y <- as.factor(df$y) > str(df) 'data.frame': 26 obs. of 3 variables: $ y :...

python,r,pandas,random-forest,rpy2

There is a pull-request that adds R factor to Pandas Categorical functionality to Pandas. It has not yet been merged into the Pandas master branch. When it is, import pandas.rpy.common as rcom rcom.convert_robj(pr) will convert pr to a Pandas Categorical. Until then, you can use as a workaround: def convert_factor(obj):...

r,random-forest,confusion-matrix

Use randomForest(..., do.trace=T) to see the OOB error during training, by both class and ntree. (FYI you chose ntree=1 so you'll only get just one rpart tree, not a forest, this kind of defeats the purpose of using RF, and randomly choosing a subset of both features and samples. You...

apache-spark,random-forest,mllib

The short version is no, if we look at RandomForestClassifier.scala you can see that it always simply selects the max. You could override the predict function if, but its not super clean. I've added a jira to track adding this.

Response from package maintainer: That parameter behaves as the way that Leo Breiman intended. The bug is in how the parameter was described. It’s the same as minsplit in the rpart:::rpart.control() function: the minimum number of observations that must exist in a node in order for a split to be...

python,r,algorithm,random-forest

I have finally been able to solve the problem. The code is still slow due to the continuous copy operation performed in each split on the data. This is the working version of the algorithm. import numpy as np import sklearn as sk import matplotlib.pyplot as plt import pandas as...

r,parallel-processing,random-forest,r-caret

The reason that you don't see a difference is that you aren't using the caret package. You do load it into your environment with the library() command, but then you run randomForest() which doesn't use caret. I'll suggest starting by creating a data frame (or data.table) that contains only your...

python,machine-learning,scikit-learn,random-forest

I get more than one digit in my results, are you sure it is not due to your dataset ? (for example using a very small dataset would yield to simple decision trees and so to 'simple' probabilities). Otherwise it may only be the display that shows one digit,...