Looking for some help from the Orange experts out there.
I have a data set of about 6 million lines. For simplicity's sake, we'll consider only two columns. One is of positive decimal numbers and is imported as a continuous value. The other is of discrete values (either 0 or 1) where there is a ratio of 30:1 for 1's to 0's.
I am using a classification tree (which I label as 'learner') to get the classifier. I'm then trying to do a cross-validation on my data set while adjusting for the overwhelming 30:1 sample bias. I've tried several variations to do this but continue to get the same result regardless of whether I stratify the data or not.
Below is my code and I've commented out the various lines I've tried (using both True and False values for stratification):
import Orange import os import time import operator start = time.time() print "Starting" print "" mydata = Orange.data.Table("testData.csv") # This is used only for the test_with_indices method below indicesCV = Orange.data.sample.SubsetIndicesCV(mydata) # I only want the highest level classifier so max_depth=1 learner = Orange.classification.tree.TreeLearner(max_depth=1) # These are the lines I've tried: #res = Orange.evaluation.testing.cross_validation([learner], mydata, folds=5, stratified=True) #res = Orange.evaluation.testing.proportion_test([learner], mydata, 0.8, 100, store_classifiers=1) res = Orange.evaluation.testing.proportion_test([learner], mydata, learning_proportion=0.8, times=10, stratification=True, store_classifiers=1) #res = Orange.evaluation.testing.test_with_indices([learner], mydata, indicesCV) f = open('results.txt', 'a') divString = "\n##### RESULTS (" + time.strftime("%Y-%m-%d %H:%M:%S") + ") #####" f.write(divString) f.write("\nAccuracy: %.2f" % Orange.evaluation.scoring.CA(res)) f.write("\nPrecision: %.2f" % Orange.evaluation.scoring.Precision(res)) f.write("\nRecall: %.2f" % Orange.evaluation.scoring.Recall(res)) f.write("\nF1: %.2f\n" % Orange.evaluation.scoring.F1(res)) tree = learner(mydata) f.write(tree.to_string(leaf_str="%V (%M out of %N)")) print tree.to_string(leaf_str="%V (%M out of %N)") end = time.time() print "Ending" timeStr = "Execution time: " + str((end - start) / 60) + " minutes" f.write(timeStr) f.close()
Note: There may seem like there are syntax errors (stratified vs. stratification) but the program runs as-is without exceptions. Also, I know the documentation shows stuff like stratified=StratifiedIfPossible but for some reason, only boolean values work for me.