I have a dataset in which there are some outliers due to input errors.

I have written a function to remove these outliers from my data frame (source):

```
remove_outliers <- function(x, na.rm = TRUE, ...)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
```

Once i remove these outliers, data set is modified. When checked again new set of outliers are shown in some cases.

Is there any one stage method where we can remove all the possible outliers?

# Best How To :

I believe that "outlier" is a very dangerous and misleading term. In many cases it means a data point which should be excluded from analysis for a specific reason. Such a reason could be that a value is beyond physical boundaries because of a measurement error, but not that "it does not fit the other points around it".

Here, you specify a statistical criterion based on the distribution of the actual data. Let aside that I don't find that approach appropriate here (because these data a presumably precisely measured for a given car), when you apply `remove_outliers`

to the data, the function will determine the outlier limits and set the data points beyond these limits to `NA`

.

```
## Using only column horsepower
dat <- read.csv("./cars.csv")
hp <- dat$Horsepower
## Calculates the boundaries like remove_outliers
calc.limits <- function(x, na.rm = TRUE) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm)
H <- 1.5 * IQR(x, na.rm = na.rm)
lwr <- qnt[1] - H
upr <- qnt[2] + H
c(lwr, upr)
}
> calc.limits(hp)
25% 75%
-1.5 202.5
```

This results in a new data set with NA values. When you apply `remove_outliers`

to the already reduced data set, the statistics will differ, and so will the limits. Thus, you will get "new" outliers (see Roland's comment).

```
hp2 <- remove_outliers(hp)
calc.limits(hp2)
> calc.limits(hp2)
25% 75%
9 185
```

You can visualize this fact:

```
plot(hp, ylim = c(0, 250), las = 1)
abline(h = calc.limits(hp))
abline(h = calc.limits(hp2), lty = 3)
```

The solid lines indicate the limits of the original data, the dotted lines that of the already reduced data. First, you lose 10 data points, and then another 7.

```
> sum(is.na(hp2))
[1] 10
> sum(is.na(remove_outliers(hp2)))
[1] 17
```

In conclusion, if you don't have a good reason to remove a data point, just don't do it.