Detection and Correction: Four Ways to Find Data ErrorsReading time: 3 minutes
Errors are the norm rather than the exception when working with data. You’ve probably heard the statistic that 88% of spreadsheets contain errors. Since we cannot assume that any of the data we work with is error-free, our mission is to find and tackle errors in the most efficient way possible. Everyone has their own way of catching errors, but here are a few quick strategies that have helped me over the years:
4 Ways to Find Data Errors
Method 1: Gauge min and max values
Method 2: Look for missings
Method 3: Check the values of categorical variables
Method 4: Look at the ‘incidence rate’ of binary variables
For continuous variables, checking the minimum and maximum values for each column can give you a quick idea of whether your values are falling within the correct range. For example, with a variable like age, I want to make sure that my minimum value makes sense, and that I’m not maxing out somewhere in the 400’s. Checking min and max values is a great way to spot extra zeroes or missing digits, and fix them before they enter an analysis.
Look for missings
The easiest way to find missings is to perform a count, if you have this function available. If not, there are other ways to find missing values. Try sorting your columns (both ‘ascending’ and ‘descending’) to see if any missing values exist in your columns, or filtering your dataset such that you’re only looking at records with a missing value. While sometimes missing values are inevitably due to chance, it’s worth double-checking to see if there might be an underlying reason for missingness, and address them as best you can.
Check the values of categorical variables
Depending on your methodology and the number of people contributing to a database, there can be lots of room for error when entering data. One quick way to find these is to pull up all of the different categories that a categorical variable can take on. A quick example: recently I was looking at a field set to store U.S. states. The values in the upper part of the spreadsheet looked good – ‘NH’, ‘ME’, etc. – but when I got to the lower part, the entry method had switched to more full values – ‘New Hampshire’ and ‘Maine’. Correcting the categories was a quick fix, but it’s best to catch these errors as early on as possible.
Look at the ‘incidence rate’ of binary variables
If we think of a true binary variable as one made up of 1’s and 0’s, looking at its mean (or the incidence rate) will tell you the proportion of 1’s you have in your dataset. It’s worth double-checking this to make sure that your binary is set up correctly. One common mistake I’ve seen is to have 1’s and nulls, rather than 1’s and 0’s. This becomes easy to spot because the “rate” of the binary variable will be equal to 1. The proportion of 1’s you get should make sense for the behavior you are trying to flag within your dataset.
Catching errors early on in the process is important so that early mistakes don’t influence decisions later on in the analysis. After all, our analysis is only as good as the data we use to create it. These are some of my most efficient mistake-catching techniques, but I’m sure there are others. If you have suggestions of your own, leave them in the comments!
If you’re looking for a way to catch and avoid these errors without having to manually assess your data, you might be interested in our data prep software, Construct! With a visual builder and automated jobs, it makes data cleaning easy. Click the link below if you’d like a demo.