7 Data Cleanup Terms Explained VisuallyReading time: 4 minutes
During a recent data conference, a coworker and I realized that there is a bit of a language barrier between “data people” and “non-data people”. For us, this was most apparent when we tried to describe data cleanup operations — while there are technical terms for types of data cleanup, we’ve found that many of them can be described just as easily in plain English. While we’ve defined these terms in both a technical and less-technical way, it might still be hard to visualize what these terms are actually doing…
So we asked ourselves, “What does data cleanup look like?”, in hopes that this might be another way to communicate the meaning of these terms in a way that is less technical. Below, we’ve created a visual glossary to answer that question and explain some commonly used data cleanup operations.
Aggregating is sorting data and then expressing the data in a summary form.
Here we have a list of Nobel Peace Prize Winners that includes the country they hail from which we’ve aggregated to get the total number of Peace Prize Winners from each country.
Filtering a dataset narrows it down to just a specific group of records:
Here we have a list of elements and their element type which we’ve filtered down to just the metal elements.
When your data is scattered in multiple datasets, merging allows you to combine the relevant parts of those datasets to create a new file to work with.
Here we have two datasets; one contains a column for state and for state capitol and the other contains a column for state and the other contains a column for state and a column for state population. Through data cleanup we’ve merged these datasets together to create one dataset that contains state, state, capital, and state population.
To append two datasets is to stack them to create one larger dataset. Usually, when appending data, the datasets contain the same (or very similar) fields.
Here we have two datasets. One contains all of the superhero movies released in 2013. The other contains all of the superhero movies released in 2014. We’ve appended these datasets together to create one stacked dataset. The new dataset contains all of the superhero movies released in 2013 and 2014.
To dedupe is to remove duplicates from a dataset.
Here we have a list of emails from our Mickey Mouse Club newsletter, which includes a couple of duplicates (highlighted). We’ve deduped this list so that we have one single entry for each person (or mouse, or duck, or dog) on our list.
To transform a column is to perform an operation on or using that column that results in a new outcome. This could be a new variable entirely, or a different version of the inputted column.
Here we have a dataset that contains the first and last names of our earliest presidents. We’ve combined the two columns to create a brand new column, “Full Name”.
To cleanse a column is to clean up the values within that column, commonly by replacing them.
Here we have a gender column that we’ve pulled out of our database. We noticed that our entries for gender are not uniform. For example, “female”, “fem”, and “F” can all represent “female”. We’ve cleansed the data to make the entries more consistent within the column.
Data Cleanup Resources
If you’re looking for further explanations of data, analytics, and modeling, I’d recommend checking out our video series, which outlines the modeling process from raw data through to reporting results — in 5-minute bite-sized chunks.