Outliers are the really annoying data points that all researchers hope they won’t have in their data, although they would be lucky to manage this. And why is this? Because they’re basically just a pain and can threaten the validity of our data if treated incorrectly, or not at all.
Before going into the details of why it is important to pay attention to outliers lets first look at what one is. The basic definition of an outlier is that it is an extreme value that does not follow the norm, or the pattern of the majority of our data. For example, the graph shown below shows a set of data which has a strong correlation or pattern, but one data point is positioned away from the rest of the points, so is the outlier.
To start with the good news about outliers is that sometimes they can be helpful (I know this is very hard to believe). Once outliers have been identified they can be looked at more closely and can lead to some unexpected knowledge, and can show more about individuals that do not fit the ‘norm’. They can also be used to reveal errors within the research model. For example, if there is a form of measurement error, such as the data being recorded incorrectly, or a participant has not understood instructions then these are possible causes of outliers that the researcher could modify their study to exclude in the future.
However now we come to the bad news. Outliers are more often than not seen as a problem rather than a help. Not only do they suggest that the data was taken from a different population than the intended population, thus effecting external validity, but it can also cause problems with analysis. An outlier can distort results, such as dragging the mean in a certain direction, and can lead to faulty conclusions being made.
Detecting these outliers can be a very complicated task and can sometimes require assumptions to be made about the parameters of the data and the expected distribution. For example, outliers are often detected as values that lie outside a certain range, but how is this range calculated? Sometimes it is based on the expected standard deviation, or by calculating the interquartile ranges and allowing a multiple of this either end of the spectrum. This method is shown by the use of a box plot as well. Other visual aids, such as a scatter graph, as shown earlier, can also help identify outliers. They are particularly problematic in categorical data as they are more complicated to deal with, but I won’t go into that here, but this article explains more about one method of dealing with them called wavelet transforms.
Now it’s time to embrace yourselves as we come to the ugly part of working out what to do with these outliers. There are many methods of doing this, which include both leaving them in the data and taking them out. An important thing to do before modifying your research is to work out how influential a data point is on the data as a whole. This can be done by using the adjusted predicted value. This involves removing the suspected point from the data to create a new model and to analyse this. By using the results of this analysis you can then predict what the data point should be, and if the difference between the predicted value and the actual value is large then it’s influential and can affect your data. For more info on this and other methods please see ‘Discovering Statistics using SPSS’ by Andy Field, chapter 5.
Extreme values can be ignored by some methods of analysis, such as when using z-scores. This involves using the middle 95% or 99% of the data (depending on the researchers preference), which the ‘normal population’ should come under. Therefore, any extreme values should come under the remaining percentage excluded from the calculations, so will have little impact on the results. Another way in which the impact of outliers can be reduced is by using accommodation or robust methods. This uses non-parametric statistics, such as the median, and can use simple estimation tasks, so that the impact of extreme values is lessened. Transformation is another method that can be used to work with outliers. This method involves taking the logs of all of the data and using this instead of the raw data as it reduces the skew. If none of these methods can be appropriately used to accommodate for outliers then it could be justifiable to remove them from the data altogether, but this is only done as a last resort.
So now that we’ve looked at identifying and dealing with outliers, I think we can safely say it’s okay to really hate them and start panicking about the prospect of having these in our dataset. However, I hope I have also made it clear that just because your data has outliers it does not mean it’s the end of the world as they can be managed, and provided they are dealt with correctly your data can still provide valid results.
For more information about the whole process of dealing with outliers see this article here.