Detection and Correction: Data Prep PitfallsReading time: 4 minutes
Seven challenges to be aware of when exploring your data
According to an article written by Northeastern University, the total amount of data in the world was 4.4 zettabytes in 2013 (That’s 4,400,000,000,000 gigabytes!!). We are set to reach 44 zettabytes by 2020. It sounds like I am making these words up, but to put this more plainly, according to Forbes, 90% of the data in the world today was created in the last two years alone, meaning that the growth of data continues to accelerate at breakneck speeds.
When it comes to data prep, knowledge of the data itself is the most important tool you can arm yourself with. With this knowledge, you can then identify what relevant data you need to access. Bringing all of the different information together can be a strenuous process and leaves us with many challenges when preparing this data for reporting or predictive modeling. In another blog entry, we discussed ways to identify data errors. In this post, I’ve compiled a list of seven common challenges that you should be aware of.
1. Monotonic Variables
A monotonic variable is one that increases without bound. This would be a variable such as a date, social security numbers, customer ID, etc. It is the transformation of these variables that allow us to glean some information about the individuals in our data set. For instance, if we have a customer’s last interaction date we could create a new variable representing the number of days since their last interaction. With customer ID’s simply adding a text component to these variables would ensure that this variable is not later confused as a continuous variable.
An outlier is a single, or very low frequency, occurrence of a value of a variable that is far away from the bulk of the values of that variable. As a general rule of thumb, if you can positively identify it as a mistake then you can try to correct the entry or just include it as a missing value. However, the data analyst must assume that all outliers are in fact outliers until they are positively identified as an error.
The Visual Analysis feature in Predict enables you to quickly see relationships within your data – including outliers.
If a variable is sparsely populated with entries if it is at all possible to collapse the variable that might be the best solution, which would include binning or other forms of cleansing. Another way to handle these would be to consider creating a dummy variable that marks whether or not the information exists at all. If, for instance, you had in your dataset emails of all individuals, but it happened to be sparsely populated, just the fact that an email exists might inform you of the type of relationship you have with that individual.
4. Data Consistency
When working with datasets it is important to review the data for consistency. Many manually entered text variables contain many unique values that in fact represent a single entry. It is important to cleanse these variations so that they are represented as a single entry. If working with Binary variables you want to make sure that they are represented uniformly.
The Cleanse Node in Construct makes fast work of cleaning up inconsistent data.
5. Data Pollution
One of the most common forms of data pollution is when users stretch a system beyond its means. For instance, in Higher Ed, instead of having a separate column to record ACT scores they were entered in the SAT column. Parsing this data out is essential.
6. Zip Codes
Zipcodes are constantly an issue when preparing data. The relationship between zip code and other variables cannot be easily identified. In Rapid Insight Software there is an easy way to calculate the distance between zip codes to return a value that we can then use.
7. Missing Values
First, it is important to differentiate, if possible, between values that are missing and those that are empty. Missing values are values that were not entered, but most certainly exist. Something like gender could be considered missing. Empty variables are values that do not actually exist.
The real question is how do we replace the missing values. Unfortunately, there isn’t a single catch-all answer here. This is entirely dependent on the variable and your knowledge of it. You could fill with the variable mean, fill with zero, or exclude the entire record. Also, there are several methods that can be used to estimate an appropriate value to plugin for one that is missing. You need to identify which are the important relationships to preserve and that will indicate how to find the estimate for those missing values. The key here is that when replacing the missing values we do not introduce a pattern that doesn’t already exist in the natural data.
Although these challenges always chance being present in data clean up, there are tools to ease this pain that make the process smoother, faster, and way more fun! Sometimes finding the right software for the job can be a challenge in itself. Construct makes that job a piece of cake. It’s our data blending and preparation tool with a user-friendly, drag and drop interface. With Construct, users can cut their data clean up down to a fraction of the time and then automate the process for convenience. Don’t let your Big Data mess get the best of you!