Navigation Tree

Correlation

Last modified 17 Feb 2026 09:41 +01:00

Correlation feature

This page is an introduction to Correlation midPoint feature. Please see the feature page for more details.

Table of Contents

Introduction
Configuration
Advanced concepts
Limitations

Introduction

Correlation (also known as smart correlation) is a mechanism used to correlate identity data to existing focus objects in the repository. It is typically used during the synchronization process to match newly discovered accounts on a resource with midPoint focus objects, or during a manual or automated registration of new users (including self-registration).

The goal of correlation is to provide a configurable correlation mechanism that can provide approximate matching. Then the match can be resolved automatically if it meets a defined confidence threshold, or manually by a human operator.

To see how to configure correlation in GUI, refer to Resource wizard: Object type correlation.

Configuration

The correlation mechanism is based on correlation rules, technically called correlators. For example, a rule can state that "if the family name, date of birth, and the national-wide ID all match, then the identity is the same". Another rule can state that "if (only) the national-wide ID matches, then the identity is the same with the confidence level of 0.7" (i.e., 70% confidence).

In the future, we plan to provide AI-assisted correlation that will suggest correlation candidates also according to human resolution of previously disputed correlation situations. At that time, the correlation rules will be not the only - or even not the primary - source for correlation suggestions. But currently, they are the only driver of the correlation algorithm.

Correlation rule types

There are the following types of correlation rules:

Table 1. Types of correlation rules
Type	Meaning
`items`	Item-based correlation rule (recommended).
`filter`	Legacy filter-based correlation rule.
`expression`	Experimental rule, based on an evaluation of a custom expression.
`idmatch`	Rule that uses an external ID Match service. See Identity Matching (Correlation) Implementation for more information.

Precisely speaking, there is also a composite rule that provides an aggregation of the results of its children. However, currently it is supported only as a top-level rule, i.e., it is present automatically - without the possibility (or need) to be specified explicitly.

Correlation configuration placement

Correlation configuration can reside in the following places:

A resource object type definition: either in a top-level correlation item, or distributed into individual attribute definitions.
An object template, currently in a top-level correlation item. ^[1]

The reason for such flexibility is that in some scenarios, correlation is bound to a certain type of focus objects, regardless of the origin of identity data we need to correlate. They can come from any resource or (in the future) they may come from registration or self-registration processes. In other scenarios, though, correlation rules are specific to a resource object type.

When present, the configuration attached to the resource object type takes precedence over the one connected to the object template.

The configuration attached to an object template requires the use of archetypes. See Limitations.

Configuration examples

Example 1: Attribute-bound definition

The following is the most basic example: an attribute is mapped to a focus property that serves as a correlation item.

Listing 1. icfs:name serving as a correlation attribute

<schemaHandling>
    <objectType>
        ...
        <attribute>
            <ref>icfs:name</ref>
            <correlator/> (2)
            <inbound> (1)
                <target>
                    <path>name</path>
                </target>
            </inbound>
        </attribute>
        ...
    </objectType>
</schemaHandling>

1	Means that the `icfs:name` attribute is mapped to the `name` focus property.
2	Means that the account is correlated to the focus objects by searching for the corresponding value of the `name` property.

If multiple attributes are marked as correlator, any of them matching is enough for an overall match. Technically, correlators are evaluated separately; see rule composition for details. If you need to evaluate two attributes together (i.e., they both have to match), you need to use the explicit items correlator.

Correlation takes place before the regular inbound mappings are evaluated. That is why there is a special inbound mapping evaluation mode: correlation-time evaluation. Even though it is turned off by default, the attribute-level correlator element automatically turns it on for the selected inbound mapping.

Example 2: Resource object type bound definition

Here we show the same logic defined at the level of the resource object type:

Listing 2. icfs:name serving as a correlation attribute (defined at the level of resource object type)

<schemaHandling>
    <objectType>
        ...
        <attribute>
            <ref>icfs:name</ref>
            <inbound>
                <target>
                    <path>name</path>
                </target>
            </inbound>
        </attribute>
        ...
        <correlation>
            <correlators>
                <items>
                    <item>
                        <ref>name</ref> (1)
                    </item>
                </items>
            </correlators>
        </correlation>
        ...
    </objectType>
</schemaHandling>

1	Declaring the `name` to be the correlation item.

As we have seen in Example 1, mentioning name as a correlation item enables the correlation-time inbound processing.

Example 3: Object template based correlation definition

Finally, this is how the correlation can be defined at the level of an object template. Here we show a rule requiring that both given name and family name match.

Listing 3. Correlation defined at the object template level: both given and family name have to match

<objectTemplate oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d" xmlns="...">
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>givenName</ref>
                </item>
                <item>
                    <ref>familyName</ref>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

The correlation-time inbound processing is automatically enabled also in this case. The object template must be connected to the resource object type via the archetype declared in the object type definition.^[2] An example:

Listing 4. Connecting an object template to a resource object type via an archetype

<resource oid="..." xmlns="...">
    ...
    <schemaHandling>
        <objectType>
            ...
            <focus>
                <type>UserType</type>
                <archetypeRef oid="36d04df1-8f81-4442-b576-97b54c716245" />
            </focus>
            ...
        </objectType>
    </schemaHandling>
</resource>

<archetype oid="36d04df1-8f81-4442-b576-97b54c716245" xmlns="...">
    ...
    <archetypePolicy>
        <objectTemplateRef oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d"/>
    </archetypePolicy>
    ...
</archetype>

Example 4: Correlation for outbound resources

Correlation relies on inbound mapping converting resource’s attribute to a property of a midPoint object. Such approach is perfect for inbound resources because it simplifies the configuration. Nevertheless, there are use cases with a strictly outbound resource with existing accounts that need to be correlated. In such cases, having an inbound mapping is not desired.

For this situation, midPoint allows you to configure mapping only for correlation and not for "standard" processing (by clockwork).

Listing 4. Using inbound mapping only for correlation

<schemaHandling>
    <objectType>
        ...
        <attribute>
            <ref>icfs:name</ref>
            <correlator/>
            <inbound>
                <target>
                    <path>name</path>
                </target>
                <use>correlation</use> (1)
            </inbound>
            <outbound> (2)
                ...
            </outbound>
        </attribute>
        ...
    </objectType>
</schemaHandling>

1	Means that the inbound mapping will be used only for correlation and will not be processed otherwise.
2	Represents the outbound mapping as usual.

Advanced concepts

Multiple correlation rules

In more complex deployments, there may be multiple correlation rules. For example, we may want to correlate by given name, family name, date of birth, and national ID using the following rules:

Table 2. Sample set of correlation rules
Rule#	Situation	Resulting confidence
1	Family name, date of birth, and national ID exactly match.	1.0
2	Given name, family name, and date of birth exactly match.	0.4
3	The national ID exactly matches.	0.4

For details on confidence values, see Rule Composition.

These rules can be configured like this:

Listing 5. Configuration for the rules 1-3 from Table 2

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>familyName</ref>
                </item>
                <item>
                    <ref>extension/dateOfBirth</ref>
                </item>
                <item>
                    <ref>extension/nationalId</ref>
                </item>
                <composition>
                    <weight>1.0</weight> <!-- this is the default -->
                </composition>
            </items>
            <items>
                <item>
                    <ref>givenName</ref>
                </item>
                <item>
                    <ref>familyName</ref>
                </item>
                <item>
                    <ref>extension/dateOfBirth</ref>
                </item>
                <composition>
                    <weight>0.4</weight>
                </composition>
            </items>
            <items>
                <item>
                    <ref>extension/nationalId</ref>
                </item>
                <composition>
                    <weight>0.4</weight>
                </composition>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

There are a lot of configuration options here. For example, we can specify the order of rules evaluation and their "A implies B" relations that ensure the correct computation of confidence in case of rule A implying rule B. For details, see Rule Composition.

Custom indexing

This feature is available only when using the native repository implementation.

Sometimes, we need to base the search on data indexed in a specific way. For example, we may need to match only the first five normalized characters of surnames. Or, when searching for a national ID, we may want to take only digits into account.

These requirements can be configured like this:

Listing 6. Examples of custom indexing

<objectTemplate>
    ...
    <item>
        <ref>familyName</ref>
        <indexing>
            <normalization>
                <steps>
                    <polyString> (1)
                        <order>1</order>
                    </polyString>
                    <prefix> (2)
                        <order>2</order>
                        <length>5</length>
                    </prefix>
                </steps>
            </normalization>
        </indexing>
    </item>
    <item>
        <ref>extension/nationalId</ref>
        <indexing>
            <normalization>
                <name>digits</name> (3)
                <steps>
                    <custom>
                        <expression>
                            <script>
                                <code>
                                    basic.stringify(input).replaceAll("[^\\d]", "") (4)
                                </code>
                            </script>
                        </expression>
                    </custom>
                </steps>
            </normalization>
        </indexing>
    </item>
    ...
</objectTemplate>

1	Applies the default PolyString normalizer to the original value.
2	Takes the first 5 characters of the normalized value.
3	Name by which this normalization can be referenced.
4	Removes everything except for digits.

These indexes are then used automatically when correlating according to familyName and extension/nationalId, respectively.

If there are multiple normalizations defined for a given focus item (and none is defined as the default one), we can select the one to be used by mentioning it within the correlation item definition:

Listing 7. Selecting the proper normalization for correlation

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>extension/nationalId</ref>
                    <search> (1)
                        <index>digits</index>
                    </search>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

1	Points to the `digits` normalization for the `extension/nationalId` property.

See Custom Indexing and The Items Correlator for more information.

Fuzzy searching

By default, searching is done using "exact match" criteria, either on original values or values that have gone through the standard or custom normalization. Sometimes, however, we want to search for objects that have a property value similar to the value we have at hand. For example, we get an account for Jack Sparrow, but besides matching users with surname Sparrow we may want to consider also users Sparow, Sparrou, and so on; although potentially with a lower confidence value.

To do this, a fuzzy search logic was implemented. There are two methods available:

Table 3. Fuzzy string matching methods
Method	Description
Levenshtein edit distance	Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. (From wikipedia.)
Trigram similarity	Matches using the ratio of common trigrams to all trigrams in compared strings. (See PostgreSQL documentation on `pg_trgm` module.)

The fuzzy search is available only when using the native repository implementation.

See an example below that searches for users with the given name and family name close to the provided names. The given name has to have a Levenshtein edit distance (to the provided one) at most 3. The family name has to have a trigram similarity (to the provided one) at least 0.8.

Listing 8. Correlation using fuzzy string matching

<objectTemplate>
    ...
    <correlation>
        <correlators>
            <items>
                <item>
                    <ref>givenName</ref>
                    <search>
                        <fuzzy>
                            <levenshtein>
                                <threshold>3</threshold>
                            </levenshtein>
                        </fuzzy>
                    </search>
                </item>
                <item>
                    <ref>familyName</ref>
                    <search>
                        <fuzzy>
                            <similarity>
                                <threshold>0.8</threshold>
                            </similarity>
                        </fuzzy>
                    </search>
                </item>
            </items>
        </correlators>
    </correlation>
</objectTemplate>

See Fuzzy Searching for more information.

Multiple identity data sources

Advanced correlation often needs to go hand in hand with situations when there are multiple sources of identity data. For example, a university may have its Student Information System (SIS) providing data on students and faculty, a Human Resources (HR) System keeping records of all staff - faculty and others, and an "External persons" (EXT) system for maintaining data about visitors and other persons related to the university in a way other than being a student or employee.

While the data about a person are usually consistent, there may be situations when they differ. For example, the given name may be recorded differently in the SIS and HR systems. Or a title may not be updated in HR. An old record in the "external persons" system may be out-of-date altogether.

This situation leads to the following kinds of requirements:

When processing data from these systems, midPoint has to be able to decide which ones are "authoritative", that is, which ones to propagate to the "official" user data stored in the repository.
When correlating, we may want to match data from all systems for the candidate owners. (Not only the "official" user data.)

MidPoint supports these requirements. For the first one, the engineer must provide an algorithm for determining the authoritative data source. The second one is provided transparently, by indexing the data from all identity sources.

The following example shows how to configure the givenName, familyName, dateOfBirth, and nationalId as "multi-source" properties. They are kept separately for each source: SIS, HR, and "external persons" system. The order of "authoritativeness" is: SIS, HR, external, as can be seen in the defaultAuthoritativeSource mapping.

Listing 9. Setting up four multi-source properties

<objectTemplate>
    ...
    <item>
        <ref>givenName</ref>
        <multiSource/> (1)
    </item>
    <item>
        <ref>familyName</ref>
        <multiSource/>
    </item>
    <item>
        <ref>extension/dateOfBirth</ref>
        <multiSource/>
    </item>
    <item>
        <ref>extension/nationalId</ref>
        <multiSource/>
    </item>
    ...
    <multiSource>
        <defaultAuthoritativeSource> (2)
            <expression>
                <script>
                    <code>
                        import com.evolveum.midpoint.util.MiscUtil

                        def RESOURCE_SIS_OID = '...'
                        def RESOURCE_HR_OID = '...'
                        def RESOURCE_EXT_OID = '...'

                        // The order of authoritativeness is: SIS, HR, external

                        if (identity == null) {
                            return null
                        }

                        def sources = identity
                                .collect { it.source }
                                .findAll { it != null }

                        def sis = sources.find { it.resourceRef?.oid == RESOURCE_SIS_OID }
                        def hr = sources.find { it.resourceRef?.oid == RESOURCE_HR_OID }
                        def external = sources.find { it.resourceRef?.oid == RESOURCE_EXT_OID }

                        MiscUtil.getFirstNonNull(sis, hr, external)
                    </code>
                </script>
            </expression>
        </defaultAuthoritativeSource>
    </multiSource>
</objectTemplate>

1	Marks a property as "multi-source".
2	A mapping that selects the most authoritative data source for a given user.

See Multiple Identity Data Sources for more information.

Limitations

As a general rule, when referencing a configuration related to correlation (including custom indexing or multi-source processing) in an object template, the configuration must be bound to the resource object type in question via statically-defined archetype (see Listing 3 and 4 in [Example 3: Object Template Based Correlation Definition]).

Other limitations are mentioned on pages for individual sub-features:

1. The item-bound usage is planned for the future. It can be configured now, but will not have any effect.

2. The main reason is that midPoint has to know the archetype before the correlation-time mappings are evaluated. That is why it is not sufficient if it is determined, e.g., during inbound processing.

Compliance

This feature is related to the following compliance frameworks:

Was this page helpful?

YES NO

Thanks for your feedback