Description

The Fuzzy Match step finds strings that potentially match using duplicate-detecting algorithms that calculate the similarity of two streams of data. This step returns matching values as a separated list as specified by user-defined minimal or maximal values.

General Tab

The General tab enables you to define the source transformation step, field, and which algorithm to use to match similar strings of data.

Option

Definition

Step name

Name of this step as it appears in the transformation workspace

Lookup step

Identifies the step that contains the fields to match

Lookup field

Identifies the field to match

Main stream field

Identifies the primary stream to match the Lookup field with

Algorithm

Identifies which string-matching algorithm to use---options include Levenshtein, Damerau-Levenshtein, Needleman Wunsch, Jaro, Jaro Winkler, Pair letters similarity, Metaphone, Double Metaphone, SoundEx, or Refined SoundEx

Case sensitive

Identifies if streams can or cannot differ based on the use of uppercase and lowercase letters---only for use with the Levenshtein algorithms

Get closer value

When checked, returns a single result with the highest similarity score---when unchecked, returns all matches that satisfy the minimal and maximal value setting as a separated list, separated by the values separator

Minimum value

Identifies the lowest possible similarity score

Maximal value

Identifies the highest possible similarity score

Values separator

Identifies the string that separate the matches. Only available for specific algorithms and when the Get closer value option is unchecked.

Algorithm Definitions

Within the Algorithm field, there are several options available to compare and match strings.

Fields Tab

The Fields tab enables you to define how to return the results of a comparison.

Option

Definition

Match field

Defines the name of the column that contains the comparison value

Value field

Defines the similarity score for which to return a value

You can also specify the list of additional fields to retrieve from the lookup stream.