Navigation Tree

Smart Correlation

Last modified 13 Sep 2022 13:20 +02:00

feature

This page describes midPoint feature. Please see the feature page for more details.

Since 4.6

This functionality is available since version 4.6.

Table of Contents

Introduction
Configuration
Advanced Concepts
Limitations

Introduction

The "smart correlation" is a mechanism to correlate identity data to existing focus objects in the repository. Typical use is e.g. the resource objects correlation during the synchronization process, where (newly discovered) accounts on a resource are synchronized to midPoint. Another use will be the correlation during manual or automated registration of new users, including self-registration.

In midPoint 4.4 and before, the only way of correlation was the use of correlation filters, with strictly binary output: either a matching object was found, or there was no match.

In midPoint 4.5, we introduced manual correlation for situations where there is a candidate match (or more candidate matches) that need to be resolved by the human operator. It worked with the external ID Match API service.

The goal for midPoint 4.6 and beyond is to provide a configurable correlation mechanism that can provide approximate matching. For short, it is called smart correlation.

Configuration

In 4.6, the correlation mechanism is based on correlation rules, technically called correlators. For example, a rule can state that "if the family name, date of birth, and the national-wide ID all match, then the identity is the same". Another rule can state that "if (only) the national-wide ID matches, then the identity is the same with the confidence level of 0.7" (whatever the number means).

In the future, we plan to provide AI-assisted correlation that will suggest correlation candidates also according to human resolution of disputed correlation situations in the history. At that time, the correlation rules will be not the only - or even not the primary - source for correlation suggestions. But, in 4.6, they are the only driver of the correlation algorithm.

Correlation Rule Types

There are the following types of correlation rules:

Table 1. Types of correlation rules
Type	Meaning	Available since
`items`	Smart item-based correlation rule. The suggested one.	4.6
`filter`	Legacy filter-based correlation rule.	1.7
`expression`	Experimental rule, based on an evaluation of a custom expression.	4.5
`idmatch`	Rule that uses an external ID Match service. Please see this document for more information.	4.5

Precisely speaking, there is also a composite rule that provides an aggregation of the results of its children. However, in 4.6 it is supported only as a top-level rule, i.e., it is present automatically - without the possibility (nor need) to specify it explicitly.

Correlation Configuration Placement

The correlation configuration can reside in the following places:

A resource object type definition: either in top-level correlation item, or distributed into individual attribute definitions.
An object template, currently in top-level correlation item. ^[1]

The reason for such flexibility is that in some scenarios, the correlation is bound to given type of focus objects, regardless of the origin of identity data we need to correlate. They can come from any resource or (in the future) they may come from registration or self-registration processes. In other scenarios, though, the correlation rules are specific to given resource object type.

When present, the configuration attached to the resource object type takes precedence over the one connected to the object template.

The configuration attached to the object template requires the use of archetypes. See Limitations.

Configuration Examples

Example 1: Attribute-Bound Definition

The following is the most basic example: an attribute is mapped to a focus property that serves as a correlation item.

Listing 1. icfs:name serving as a correlation attribute

<schemaHandling>
    <objectType>
        ...
        <attribute>
            <ref>icfs:name</ref>
            <correlator/> (2)
            <inbound> (1)
                <target>
                    <path>name</path>
                </target>
            </inbound>
        </attribute>
        ...
    </objectType>
</schemaHandling>

1	Means that the `icfs:name` attribute is mapped to the `name` focus property.
2	Means that the account is correlated to the focus objects by searching for the corresponding value of `name` property.

Notes:

The correlation takes place before the regular inbound mappings are evaluated. That is why - starting with 4.5 - there is a special inbound mapping evaluation mode: correlation-time evaluation. Even though it is turned off by default, the attribute-level correlator element automatically turns it on for the selected inbound mapping.
If multiple attributes are marked as correlator, they are evaluated separately. If there is a need of evaluation of two attributes together (meaning that they have to match both), it is necessary to use the explicit items correlator.

Example 2: Resource Object Type Bound Definition

Here we show the same logic defined at the level of the resource object type:

Listing 2. icfs:name serving as a correlation attribute (defined at the level of resource object type)

<schemaHandling>
    <objectType>
        ...
        <attribute>
            <ref>icfs:name</ref>
            <inbound>
                <target>
                    <path>name</path>
                </target>
            </inbound>
        </attribute>
        ...
        <correlation>
            <correlators>
                <items>
                    <item>
                        <ref>name</ref> (1)
                    </item>
                </items>
            </correlators>
        </correlation>
        ...
    </objectType>
</schemaHandling>

1	Declaring the `name` to be the correlation item.

Like we have seen in Example 1, mentioning the item name as a correlation item enables the correlation-time inbound processing for it.

Example 3: Object Template Based Correlation Definition

Finally, this is how the correlation can be defined at the level of an object template. Here we show a rule requiring that both given name and family name match.

Listing 3. Correlation defined at the level of object template: requiring a match of both given and family name

<objectTemplate oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d" xmlns="...">
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>givenName</ref>
                </item>
                <item>
                    <ref>familyName</ref>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

The correlation-time inbound processing is automatically enabled also in this case. The object template must be connected to the resource object type via the archetype declared in the object type definition.^[2] An example:

Listing 4. Connecting an object template to resource object type via an archetype

<resource oid="..." xmlns="...">
    ...
    <schemaHandling>
        <objectType>
            ...
            <focus>
                <type>UserType</type>
                <archetypeRef oid="36d04df1-8f81-4442-b576-97b54c716245" />
            </focus>
            ...
        </objectType>
    </schemaHandling>
</resource>

<archetype oid="36d04df1-8f81-4442-b576-97b54c716245" xmlns="...">
    ...
    <archetypePolicy>
        <objectTemplateRef oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d"/>
    </archetypePolicy>
    ...
</archetype>

Advanced Concepts

Multiple Correlation Rules

In more complex deployments, there may be multiple correlation rules. For example, we may want to correlate by given name, family name, date of birth, and national ID using the following rules:

Table 2. Sample set of correlation rules
Rule#	Situation	Resulting confidence
1	Family name, date of birth, and national ID exactly match.	1.0
2	Given name, family name, and date of birth exactly match.	0.4
3	The national ID exactly matches.	0.4

The confidence values are described on rule composition page.

These rules can be configured like this:

Listing 5. Configuration for the rules 1-3 from Table 2

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>familyName</ref>
                </item>
                <item>
                    <ref>extension/dateOfBirth</ref>
                </item>
                <item>
                    <ref>extension/nationalId</ref>
                </item>
                <composition>
                    <weight>1.0</weight> <!-- this is the default -->
                </composition>
            </items>
            <items>
                <item>
                    <ref>givenName</ref>
                </item>
                <item>
                    <ref>familyName</ref>
                </item>
                <item>
                    <ref>extension/dateOfBirth</ref>
                </item>
                <composition>
                    <weight>0.4</weight>
                </composition>
            </items>
            <items>
                <item>
                    <ref>extension/nationalId</ref>
                </item>
                <composition>
                    <weight>0.4</weight>
                </composition>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

There are a lot of configuration options here. For example, we can specify the order of rules evaluation and their "A implies B" relations that ensure the correct computation of confidence in case of rule A implying rule B. Please see rule composition page for more information.

Custom Indexing

This feature is available only when using the native repository implementation.

Sometimes, we need to base the search on specially-indexed data. For example, we could need to match only first five normalized characters of the surname. Or, we could want to take only digits into account when searching for the national ID.

These requirements can be configured like this:

Listing 6. Examples of custom indexing

<objectTemplate>
    ...
    <item>
        <ref>familyName</ref>
        <indexing>
            <normalization>
                <steps>
                    <polyString> (1)
                        <order>1</order>
                    </polyString>
                    <prefix> (2)
                        <order>2</order>
                        <length>5</length>
                    </prefix>
                </steps>
            </normalization>
        </indexing>
    </item>
    <item>
        <ref>extension/nationalId</ref>
        <indexing>
            <normalization>
                <name>digits</name> (3)
                <steps>
                    <custom>
                        <expression>
                            <script>
                                <code>
                                    basic.stringify(input).replaceAll("[^\\d]", "") (4)
                                </code>
                            </script>
                        </expression>
                    </custom>
                </steps>
            </normalization>
        </indexing>
    </item>
    ...
</objectTemplate>

1	Applies the default PolyString normalizer to the original value.
2	Takes the first 5 characters of the normalized value.
3	Name by which this normalization can be referenced.
4	Removes everything except for digits.

These indexes are then used automatically when correlating according to familyName and extension/nationalId, respectively.

If there are multiple normalizations defined for a given focus item (and none is defined as the default one), we can select the one to be used by mentioning it within the correlation item definition:

Listing 7. Selecting the proper normalization for correlation

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>extension/nationalId</ref>
                    <search> (1)
                        <index>digits</index>
                    </search>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

1	Points to the `digits` normalization for `extension/nationalId` property.

Please see custom indexing and items correlator for more information.

Fuzzy Searching

By default, the searching is done using "exact match" criteria, either on original values or on the ones that underwent the standard or custom normalization. Sometimes, however, we want to search for objects that have a property value somewhat similar to the value we have at hand. For example, we get an account for Jack Sparrow, but besides matching users with surname Sparrow we may want to consider also users Sparow, Sparrou, and so on; although potentially with a lower confidence value.

To do this, a fuzzy search logic was implemented. There are two methods available:

Table 3. Fuzzy string matching methods
Method	Description
Levenshtein edit distance	Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. (From wikipedia.)
Trigram similarity	Matches using the ratio of common trigrams to all trigrams in compared strings. (See PostgreSQL documentation on `pg_trgm` module.)

The fuzzy search is available only when using the native repository implementation.

An example that searches for users having given name and family name close to the provided ones. The given name has to have Levenshtein edit distance (to the provided one) at most 3. The family name has to have trigram similarity (to the provided one) at least 0.8.

Listing 8. Correlation using fuzzy string matching

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>givenName</ref>
                    <search>
                        <fuzzy>
                            <levenshtein>
                                <threshold>3</threshold>
                            </levenshtein>
                        </fuzzy>
                    </search>
                </item>
                <item>
                    <ref>familyName</ref>
                    <search>
                        <fuzzy>
                            <similarity>
                                <threshold>0.8</threshold>
                            </similarity>
                        </fuzzy>
                    </search>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

Please see fuzzy searching page for more information.

Multiple Identity Data Sources

The advanced correlation needs often go hand in hand with the situations when there are multiple sources of the identity data. For example, a university may have its Student Information System (SIS) providing data on students and faculty, Human Resources (HR) System keeping records of all staff - faculty and others, and "External persons" (EXT) system for maintaining data about visitors and other persons related to the university in a way other than being a student or employee.

While the data about a person are usually consistent, there may be situations when they differ. For example, the given name may be recorded differently in SIS and HR systems. Or the title may be forgotten to be updated in HR. An old record in the "external persons" system may be out-of-date altogether.

This situation leads to two kinds of requirements:

When processing data from these systems, midPoint has to somehow decide which ones are "authoritative", that is, which ones to propagate to the "official" user data stored in the repository.
When correlating, we may want to match data from all systems for the candidate owners. (Not only the "official" user data.)

MidPoint supports both of them. For the first one, the engineer must provide an algorithm for determination of the authoritative data source. The second one is provided transparently, by indexing the data from all the identity sources.

The following example shows how to configure givenName, familyName, dateOfBirth, and nationalId as "multi-source" properties. They are kept separately for each source: SIS, HR, and "external persons" system. The order of "authoritativeness" (so to say) is: SIS, HR, external, as can be seen in the defaultAuthoritativeSource mapping.

Listing 9. Setting up four multi-source properties

<objectTemplate>
    ...
    <item>
        <ref>givenName</ref>
        <multiSource/> (1)
    </item>
    <item>
        <ref>familyName</ref>
        <multiSource/>
    </item>
    <item>
        <ref>extension/dateOfBirth</ref>
        <multiSource/>
    </item>
    <item>
        <ref>extension/nationalId</ref>
        <multiSource/>
    </item>
    ...
    <multiSource>
        <defaultAuthoritativeSource> (2)
            <expression>
                <script>
                    <code>
                        import com.evolveum.midpoint.util.MiscUtil

                        def RESOURCE_SIS_OID = '...'
                        def RESOURCE_HR_OID = '...'
                        def RESOURCE_EXT_OID = '...'

                        // The order of authoritativeness is: SIS, HR, external

                        if (identity == null) {
                            return null
                        }

                        def sources = identity
                                .collect { it.source }
                                .findAll { it != null }

                        def sis = sources.find { it.resourceRef?.oid == RESOURCE_SIS_OID }
                        def hr = sources.find { it.resourceRef?.oid == RESOURCE_HR_OID }
                        def external = sources.find { it.resourceRef?.oid == RESOURCE_EXT_OID }

                        MiscUtil.getFirstNonNull(sis, hr, external)
                    </code>
                </script>
            </expression>
        </defaultAuthoritativeSource>
    </multiSource>
</objectTemplate>

1	Marks a property to be "multi-source" one.
2	A mapping that selects the most authoritative data source for a given user.

Please see the page on multiple identity data sources for more information.

Limitations

As a general rule, when referencing a configuration related to smart correlation (including custom indexing or multi-source processing) in an object template, it must be bound to the resource object type in question via statically-defined archetype (see Listing 3 and 4 in Example 3: Object Template Based Correlation Definition).

Other limitations are mentioned on pages for individual sub-features:

1. The item-bound usage is planned for the future. It can be configured now, but will not have any effect.

2. The main reason is that midPoint has to know the archetype before the correlation-time mappings are evaluated. That’s why it is not sufficient if it’s determined e.g. during inbound processing.

Was this page helpful?

YES NO

Thanks for your feedback