Skip To Content

Fuzzy Matching: The Jaro-Winkler Score

Fuzzy matching is a data preparation technique used to unite records that should match but currently do not. The Jaro-Winkler score is one of the formulas Construct’s fuzzy matching node uses to associate similar records.

Jaro-Winkler assigns a score between 0 and 1 to indicate the degree of difference between two entries. In Construct, a score of 1 is a perfect match (the same characters in the same order).

Construct uses the following formula to calculate a Jaro score:

⅓*(Matching Characters/Length of String1 + Matching Characters/Length of String2 + (Matching Characters – Transpositions)/Matching Characters)

To make it more simple, or complex depending on how you view things, let’s assign variables to these values where:

  • S1 and S2 are the lengths of strings 1 and 2
  • M is the number of matching characters
  • T is the number of transpositions

With these variables the formula would be as follows:

⅓*(M/S1 + M/S2 + (M – T)/M)

Now let’s see what happens when we run an example through the Jaro score:

Record 1: Macadamia

Record 2: Macedemia

Matching Characters (M) = 7

Length of String 1 (S1) = 9

Length of String 2 (S2) = 9

Transpositions (T) = 0

This leaves us with:

⅓*(7/9+7/9+(7-0)/7)= 0.85

Now, 0.85 is a relatively high score, but you may have noticed that we have so far forgotten about Winkler. The Jaro-Winkler score uses a prefix scale (p) which puts more emphasis on the beginning of a string of characters. If the prefix of two strings are close to one another, Jaro-Winkler grants the strings a higher score than it would for matching characters that come later in the string. 

With the addition of this prefix scale the formula is as follows:

Jaro Score + Prefix Length * Prefix Scale(1 – Jaro Score)

The Prefix Length is the length of the common prefix for the two strings, that allows up to a maximum of 4 characters. The Prefix Scale is a constant scaling factor that is usually set between 0 and 0.25.

In this case Macadamia and Macedemia have the first 3 characters matching, “Mac”, so the prefix length is 3 and the Prefix scale is 0.1. 

0.85 + 3*0.1(1-0.85) = 0.895

The prefix length has a maximum value of 4 and in this case only the first 3 characters match. The 0.1 is the most common scale factor that Winkler used in his work, but you can choose any value between 0 and 0.25. You can see that with the additional points awarded to a matching prefix, the score has risen from 0.85, the initial Jaro score, to 0.895, the Jaro-Winkler score.

Back to Knowledge Base