Data Preparation Concepts: What is Fuzzy Matching?Reading time: 5 minutes
Have you ever looked at a dataset and seen records that you know should match, but don’t?
It can be frustrating to sift through and correct each of the errors.
Never fear: Fuzzy Matching is here to help!
What is Fuzzy Matching?
Fuzzy matching is a data preparation technique used to unite records that should match but currently do not. The mismatch could be due to an error in data entry, a conflict in the software used to compile the documents, a merger of several documents that used different coding systems, or for any number of other reasons.
Here is an example: let’s say your database contains contact information for customers who live in a specific zip code. In this instance, you may want to group all of the entries by street name. In the Address 1 column, you might come across entries like:
- 165 Macademia Street
- 199 Macadamia Street
- 178 Macedemia Street
As the subject matter expert, you know that the street name in these entries should match, but since the first and third entries contain misspellings, each entry would appear in a distinct group when filtered by street name. Fuzzy Matching would recognize that these street names should match, then correct them.
In this simple example, it wouldn’t take long to correct the issue manually. But what if your dataset has thousands of entries and dozens of fields? You can see how allowing Fuzzy Matching to detect and correct entries automatically would save you lots of time.
How Does Fuzzy Matching Work?
Now that we know what fuzzy matching is let’s look at how it identifies and pairs similar entries.
Fuzzy matching uses algorithms that score differences between entries, associating entries that fall within a certain degree of difference from one another.
In the example above, we can intuit that Macademia is a misspelling of Macadamia, but with a fuzzy matching scoring system, we can quantify and correct for that difference.
There are several methods used to match entries. We’ll take a look at two of the most common here: The Damerau Levenshtein Score and the Jaro-Winkler Score.
The Damerau Levenshtein Score
Damerau Levenshtein scoring looks at the number of changes you’d need to make between two terms for them to match.
For a simple example, let’s look again at our street names, comparing “Macadamia Street” (the correct spelling) to “Macedemia Street” (which contains two incorrect characters).
Here’s how we’d break down the changes we need to make for these entries to match:
- Macedemia Street → Macademia Street (substituting one “e” for an “a”)
- Macademia Street → Macadamia Street (substituting another “e” for an “a”)
Counting the number of changes, the Damerau Levenshtein score for these entries is 2. If you set your threshold to 2, these entries would be matched, as would any misspellings that require just 1 character change to match. Setting the score to 2 would not catch any entries with 3 misspellings, as they surpass the threshold.
If you’re using an automated data preparation software (like ours) to fuzzily match your entries, decide on the desired number of points (i.e., changes) you want to permit. You’ll likely want to stick to 1 or 2 allowed changes, then double-check the data after fuzzy matching to fine-tune the results.
Setting the threshold at too many allowed changes could result in drastic matches you don’t want to permit. For example, Oak Street and Elm Street are only three character changes away. However, they are clearly distinct street names you would not want to match.
The Jaro-Winkler Score
The Jaro-Winkler score is a less intuitive scoring system, but it operates under the same general principle.
Jaro-Winkler assigns a score between 0 and 1 to indicate the degree of difference between two entries. In Construct (our data prep software), a score of 1 is a perfect match (the same characters in the same order).
The formula uses the number of matching characters, the length of character strings, and the number of transpositions to calculate a base similarity. It then adds to the score if there are character matches at the beginning (prefix) of a string of characters.
If you’re curious about the math behind the score, here is the formula Construct uses:
(⅓*(Matching Characters/Length of String1 + Matching Characters/Length of String2 + (Matching Characters – Transpositions)/Matching Characters)) + Prefix Length * Prefix Scale(1 – (⅓*(Matching Characters/Length of String1 + Matching Characters/Length of String2 + (Matching Characters – Transpositions)/Matching Characters))
Running the Macedamia/Macadamia example through this formula results in a Jaro-Winkler score of 0.895 (a close match). This knowledge base article goes into detail about how the example Jaro-Winkler score was calculated.
Ultimately, Jaro-Winkler carries out the same task as Damerau Levenshtein. It’s a mathematical identification of the degree of difference between two sets of characters. As with Damerau Levenshtein, you should trust your tools but also verify their results.
Fuzzy Matching for Non-Statisticians
If you work with data frequently but don’t have formal data training, our software platform can make your work much easier.
Construct, our code-free data preparation tool, uses a drag-and-drop, node-based interface to automatically organize, cleanse, and prepare your data. Construct features a Fuzzy Matching node that you can easily add to any data wrangling job.
Our Fuzzy Matching node calculates both of the scores discussed in this post, then allows you to match similar entries automatically. You can also carry out dozens of other data prep tasks, such as merging, transforming, and deduplicating data.
The result is a clean, trustworthy dataset produced by a saved, repeatable workflow that you can reuse with new data anytime you need it.
We also have a team of experienced data analysts on staff to support your work. So in the future, you won’t need to search google to learn about fuzzy matching because our analysts can assist you with all of your data needs.
If you’re interested in learning more about how our tools & support can improve your work with data, click the button below for a personalized walkthrough of our software!