Fuzzy Searching

Last modified 13 Sep 2022 22:36 +02:00

Since 4.6

This functionality is available since version 4.6.

Table of Contents

Introduction
Fuzzy Search Methods
Using in Correlation
- Specification of the Filters
- Confidence Values
Using in Filters
Limitations

Introduction

This feature is available only when using the native repository implementation.

For an introduction, please see Fuzzy Searching section in the overview document.

Fuzzy Search Methods

Currently, there are two methods available:

Table 1. Fuzzy string matching methods
Method	Description
Levenshtein edit distance	Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. (From wikipedia.)
Trigram similarity	Matches using the ratio of common trigrams to all trigrams in compared strings. (See PostgreSQL documentation on `pg_trgm` module.)

Using in Correlation

Specification of the Filters

The fuzzy searching filters are specified in search/fuzzy configuration item. Let us have look at an example that searches for users having family name within the Levenshtein distance to the provided one of at most 3.

Listing 1. Correlation looking for family name "close enough" to the provided one

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>familyName</ref>
                    <search>
                        <fuzzy>
                            <levenshtein>
                                <threshold>3</threshold>
                            </levenshtein>
                        </fuzzy>
                    </search>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

There are the following options available:

Table 2. Configuration properties for Levenshtein edit distance search
Property	Description	Default
`threshold`	Upper limit on the edit distance to be matched.	Must be specified.
`inclusive`	Is the value of "threshold" meant to be inclusive?	`true`

Table 3. Configuration properties for trigram similarity search
Property	Description	Default
`threshold`	Lower limit on the similarity to be matched.	Must be specified.
`inclusive`	Is the value of "threshold" meant to be inclusive?	`true`

Confidence Values

When using fuzzy search, not all search results are equally relevant. Typically, the higher Levenshtein edit distance, the lower confidence we have in the particular match. On the other hand, the higher trigram similarity value, the higher confidence.

Therefore, midPoint allows to specify a transformation from the fuzzy string metric (edit distance or similarity value) to the confidence value of (0, 1].

There are reasonable expressions that are used by default.

For Levenshtein edit distance, it is 1/(d+1), where d is the value of the distance. It works like this:

Table 4. Transformation of edit distance to a confidence value by the default expression
Edit distance	Resulting confidence
0 (exact match)	1.0
1	0.5
2	0.333
3	0.25
4	0.2
…	…

For trigram similarity, it is simply the value of the similarity itself.

The default formulas may or may not fit your needs. You can provide any expression to do the computation. For example, the following code will map distances of 0, 1, 2, and 3 to the confidence values of 1.0, 0.9, 0.8, and 0.7, respectively.

Listing 2. Using a custom formula for confidence value with given Levenshtein edit distance

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>familyName</ref>
                    <search>
                        <fuzzy>
                            <levenshtein>
                                <threshold>3</threshold>
                            </levenshtein>
                        </fuzzy>
                        <confidence>
                            <expression>
                                <script>
                                    <code>[1.0, 0.9, 0.8, 0.7][input]</code>
                                </script>
                            </expression>
                        </confidence>
                    </search>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

Multiple Correlation Items

If there are multiple correlation items in given correlation rule, their confidences are multiplied. For example:

Listing 3. Two items to be matched by a fuzzy search

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>givenName</ref>
                    <search>
                        <fuzzy>
                            <similarity>
                                <threshold>0.5</threshold>
                            </similarity>
                        </fuzzy>
                    </search>
                </item>
                <item>
                    <ref>familyName</ref>
                    <search>
                        <fuzzy>
                            <levenshtein>
                                <threshold>3</threshold>
                            </levenshtein>
                        </fuzzy>
                    </search>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

(Here we use the default formulas for confidence values.)

Now, let us assume that a correlation candidate has a given name with the similarity of 0.8, and the family name with an edit distance of 1. Its overall confidence is then computed as:

Table 5. Example of the confidence computation
Property	Fuzzy search metric value	Confidence factor
`givenName`	0.8	0.8
`familyName`	1	0.5
Overall confidence		0.4 (= 0.8 x 0.5)

The overall confidence value may be later scaled using a custom rule weight when rules are composed together, as described in rule composition document.

Using in Filters

The use of fuzzy matching outside correlation is highly experimental. In particular, matching of PolyString values does not work as expected. Also, the serialization format may change in the future.

Here we describe it only for educational purposes - to emphasize the fact that correlation is ultimately implemented using regular queries.

A query based on the Levenshtein edit distance:

Listing 4. Sample Levenshtein distance query in XML

<q:query xmlns:q="http://prism.evolveum.com/xml/ns/public/query-3">
    <q:filter>
        <q:fuzzyStringMatch>
            <q:path>familyName</q:path>
            <q:value>gren</q:value>
            <q:method>
                <q:levenshtein>
                    <q:threshold>3</q:threshold>
                </q:levenshtein>
            </q:method>
        </q:fuzzyStringMatch>
    </q:filter>
</q:query>

A similarity-based filter:

Listing 5. Sample trigram similarity filter in Axiom

familyName similarity ('gren', 0.5, true)

Limitations

This feature is available only when using the native repository implementation.

Was this page helpful?

YES NO

Thanks for your feedback