How to Clean Data: Our Top 10 Tips!Reading time: 4 minutes
In data analysis, the saying goes that it takes 80% of your time to clean data, which leaves only 20% of your time to work with it. But the 80/20 balance isn’t set in stone. It all comes down to how efficiently you clean your data.
There are many ways to streamline your data cleaning efforts, helping you speed up your cleaning and make the most of your analysis.
Read on for Rapid Insight Product Manager Jon MacMillan’s top 10 data cleaning tips!
1. Consult An Expert on Your Data
If you don’t know the data well, ask someone who does! Data in your dataset might be stored in multiple places or different formats. There could be a reason for this, so if someone at your organization has in-depth knowledge of the dataset, ask them for clarification before undergoing time-consuming steps that you may have to walk back in the future.
2. Inspect Min/Max Values
A quick way to spot outliers and typos is to look at min, max, and average values. For instance, an “age” variable might show a max value of 650… which is obviously outside the human lifespan! Look for values that stand out as unusual, then correct them.
3. Examine Categorical Values
Categorical fields can be frustrating since they are often open-entry text fields (for example, a Month field that lets users enter the month instead of a drop-down to select the month).
Rather than going through the individual entries, start by taking a distinct count of a categorical field to see how many unique entries exist within that column. If Month has 19 instead of 12 distinct values, you likely have either some variation of abbreviations and month names, or you have other issues, like typos. This quick assessment will let you know if you need to spend time cleaning a field.
4. Double-check Coded Data
Codes become stale or obsolete and may need to be updated or standardized. Look out for values coded in multiple ways. For instance, “NY” and “New York” can be combined into one code. Be aware of codes that are subsets of other codes.
5. Avoid Duplicates
Duplicate data points multiply the predictive power of those data points, creating inaccurate models. For instance, you might have multiple records for a donor based on all of their giving history. Rather than retaining each record, you may want to summarize this data by looking at total gift amount, average gift amount, total gifts, avg time between gifts, etc. Unless your model calls for it, it’s best to remove duplicates.
6. Make Best Use of Your Date Fields
You likely want to retain date information regardless of the analysis you are performing, but date values are often even more useful in calculations. It is helpful to know a donor’s initial gift date, but you can also calculate the number of years since their first donation. These calculations are especially useful if the intention is to use this information in a predictive model.
7. Look for Missing Values
Fill in missing variables when possible, but first, consider why a variable is missing. At times, it may be best to leave the field blank, such as a GPA field in a student record. You wouldn’t want to give a student a 0.0 GPA. But in other cases, you can impute a missing value. In a marketing or donor model, for example, you might need to calculate an income or credit score for a loan analysis to fill in a missing cell.
8. Calculate Useful Variables
Let’s say you have a list of donor addresses. At face value, these tell you how much individuals from that zip code donate, which may or may not be actionable information. But perhaps a donor’s distance from your nearest facility provides more useful information. That’s easy to calculate by comparing donor addresses to the nearest facility address and logging that in a new column of your dataset. Think about your data holistically to get to any underlying relationships. These may be more valuable than what the data offer at first glance.
9. Make Your Data Easier to Understand
Ultimately, your data needs to be interpretable by end-users, so it’s worth the time to review it and translate coded fields into meaningful phrases. For example, if a binary outcome like a Yes or No question is coded as 0=Yes and 1=No, it will be easier to interpret the data if you replace 0’s with Yes and 1’s with No.
10. Profile Your Data
After cleaning your data, look at aggregations to make sure they line up with expected totals. For instance, if you know your retention rate, take an average of a binary retained field to ensure it matches your expectations. If not, that’s a clue that you need to dig in and find out why.
How to Clean Data: Tools That Make It Possible
In addition to a solid understanding of your process, intuitive data tools can significantly improve your data cleaning efforts.
Construct, Rapid Insight’s data prep solution, fast tracks your data cleansing with a code-free, self-service solution that brings accuracy and repeatability to your work. You can automate much of your data prep process, giving you more time to focus on discovering insights.
The software uses a visual workflow and identifies duplicates, calculates new variables, and makes data cleansing a breeze. It also logs and stores your steps to make the process easy to repeat and iterate. Copy the work you’ve done and reuse it for another analysis. Make experimental changes in “sandbox” mode to add to your analysis without impacting your primary data process.
Are you interested in learning more? Schedule a demo below!