Skip To Content

7 Data Cleanup Terms Explained Visually

Reading time: 4 minutes
Data cleanup

During a recent data conference, a coworker and I realized that there is a bit of a language barrier between “data people” and “non-data people.” For us, this was most apparent when we tried to describe data cleanup operations — while there are technical terms for types of data cleanup, we’ve found that many of them can be described just as easily in plain English. While we’ve defined these terms in both a technical and less technical way, it might still be hard to visualize what these terms are actually doing…

So we asked ourselves, “What does data cleanup look like?” in hopes that this might be another way to communicate the meaning of these terms in a less technical way. Below, we’ve created a visual glossary to answer that question and explain some commonly used data cleanup operations.

Aggregating

Aggregating is sorting data and then expressing the data in a summary form.

aggregating1

Here we have a list of Nobel Peace Prize Winners that includes the country they hail from. Aggregating the list gave us the total number of Peace Prize Winners from each country.

Filtering

Filtering a dataset narrows it down to just a specific group of records:

filtering

Here we have a list of elements and their element type. We’ve filtered the list down to just the metal elements.

Merging

When your data is scattered in multiple datasets, merging allows you to combine the relevant parts of those datasets to create a new file to work with.

merging

Here we have two datasets; one contains a column for state and a column for state capitol; the other contains a column for state and a column for state population. We’ve merged these datasets to create one dataset that contains state, state, capital, and state population through data cleanup.

Appending

To append two datasets is to stack them to create one larger dataset. Usually, when appending data, the datasets contain the same (or very similar) fields.

appending

Here we have two datasets. One contains all of the superhero movies released in 2013. The other contains all of the superhero movies released in 2014. We’ve appended these datasets together to create one stacked dataset. The new dataset contains all of the superhero movies released in 2013 and 2014.

Deduping

To dedupe is to remove duplicates from a dataset.

deduping

Here we have a list of emails from our Mickey Mouse Club newsletter, which includes a couple of duplicates (highlighted). We’ve deduped this list so that we have one single entry for each person (or mouse, or duck, or dog) on our list.

Transforming

To transform a column is to perform an operation on or using that column that results in a new outcome. This could be a new variable entirely or a different version of the inputted column.

transforming

Here we have a dataset that contains the first and last names of our earliest presidents. We’ve combined the two columns to create a brand new column, “Full Name.”

Data Cleansing

To cleanse a column is to clean up the values within that column, commonly by replacing them.

datacleansing

Here we have a gender column that we’ve pulled out of our database. We noticed that our entries for gender are not uniform. For example, “female”, “fem”, and “F” can all represent “female”. We’ve cleansed the data to make the entries more consistent within the column.

Data Cleanup Made Easy

Our data cleanup software, Construct, makes all of these data cleanup tasks easy. Just drag and drop a cleanse, transform, append, or dedupe node into the visual workflow and let the software do the rest!

Interested in learning more? Click the button below to schedule a demo!

REQUEST A DEMO

Stay up to date
Subscribe to our blog

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments