Skip To Content

7 Data Cleanup Terms Explained Visually

Reading time: 4 minutes
Data cleanup

During a recent data conference, a coworker and I realized that there is a bit of a language barrier between “data people” and “non-data people”. For us, this was most apparent when we tried to describe data cleanup operations — while there are technical terms for types of data cleanup, we’ve found that many of them can be described just as easily in plain English. While we’ve defined these terms in both a technical and less-technical way,┬áit might still be hard to visualize what these terms are actually doing…

So we asked ourselves, “What does data cleanup look like?”, in hopes that this might be another way to communicate the meaning of these terms in a way that is less technical. Below, we’ve created a visual glossary to answer that question and explain some commonly used data cleanup operations.

Aggregating

Aggregating is sorting data and then expressing the data in a summary form.

Aggregate

Here we have a list of Nobel Peace Prize Winners that includes the country they hail from which we’ve aggregated to get the total number of Peace Prize Winners from each country.

Filtering

Filtering a dataset narrows it down to just a specific group of records:

Data Cleanup: Filter

Here we have a list of elements and their element type which we’ve filtered down to just the metal elements.

Merging

When your data is scattered in multiple datasets, merging allows you to combine the relevant parts of those datasets to create a new file to work with.

Merging

Here we have two datasets; one contains a column for state and for state capitol and the other contains a column for state and the other contains a column for state and a column for state population. Through data cleanup we’ve merged these datasets together to create one dataset that contains state, state, capital, and state population.

Appending

To append two datasets is to stack them to create one larger dataset. Usually, when appending data, the datasets contain the same (or very similar) fields.

Data Cleanup: Append

Here we have two datasets. One contains all of the superhero movies released in 2013. The other contains all of the superhero movies released in 2014. We’ve appended these datasets together to create one stacked dataset. The new dataset contains all of the superhero movies released in 2013 and 2014.

Deduping

To dedupe is to remove duplicates from a dataset.

Deduping

Here we have a list of emails from our Mickey Mouse Club newsletter, which includes a couple of duplicates (highlighted). We’ve deduped this list so that we have one single entry for each person (or mouse, or duck, or dog) on our list.

Transforming

To transform a column is to perform an operation on or using that column that results in a new outcome. This could be a new variable entirely, or a different version of the inputted column.

Data Cleanup: transform

Here we have a dataset that contains the first and last names of our earliest presidents. We’ve combined the two columns to create a brand new column, “Full Name”.

Data Cleansing

To cleanse a column is to clean up the values within that column, commonly by replacing them.

Cleansing

Here we have a gender column that we’ve pulled out of our database. We noticed that our entries for gender are not uniform. For example, “female”, “fem”, and “F” can all represent “female”. We’ve cleansed the data to make the entries more consistent within the column.

Data Cleanup Resources

If you’re looking for further explanations of data, analytics, and modeling, I’d recommend checking out our video series, which outlines the modeling process from raw data through to reporting results — in 5-minute bite-sized chunks.

Stay up to date
Subscribe to our blog

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments