I am generation some data whose plots are as shown below
In all the plots i get some outliers at the beginning and at the end. Currently i am truncating the first and the last 10 values. Is there a better way to handle this?
I am basically trying to automatically identify the two points shown below.
Best How To :
This is a fairly general problem with lots of approaches, usually you will use some a priori knowledge of the underlying system to make it tractable.
So for instance if you expect to see the pattern above - a fast drop, a linear section (up or down) and a fast rise - you could try taking the derivative of the curve and looking for large values and/or sign reversals. Perhaps it would help to bin the data first.
If your pattern is not so easy to define but you are expecting a linear trend you might fit the data to an appropriate class of curve using
fit and then detect outliers as those whose error from the fit exceeds a given threshold.
In either case you still have to choose thresholds - mean, variance and higher order moments can help here but you would probably have to analyse existing data (your training set) to determine the values empirically.
And perhaps, after all that, as Shai points out, you may find that lopping off the first and last ten points gives the best results for the time you spent (cf. Pareto principle).