<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<correlator/> (2)
<inbound> (1)
<target>
<path>name</path>
</target>
</inbound>
</attribute>
...
</objectType>
</schemaHandling>
Smart Correlation
Correlation feature
This page is an introduction to Correlation midPoint feature.
Please see the feature page for more details.
|
Introduction
The "smart correlation" is a mechanism to correlate identity data to existing focus objects in the repository. Typical use is e.g. the resource objects correlation during the synchronization process, where (newly discovered) accounts on a resource are synchronized to midPoint. Another use will be the correlation during manual or automated registration of new users, including self-registration.
The goal for smart correlation is to provide a configurable correlation mechanism that can provide approximate matching. Then the match can be resolved automatically based on passing defined threshold on level of confidence or it can be resolved manually by a human operator.
Configuration
The smart correlation mechanism is based on correlation rules, technically called correlators. For example, a rule can state that "if the family name, date of birth, and the national-wide ID all match, then the identity is the same". Another rule can state that "if (only) the national-wide ID matches, then the identity is the same with the confidence level of 0.7" (whatever the number means).
In the future, we plan to provide AI-assisted correlation that will suggest correlation candidates also according to human resolution of disputed correlation situations in the history. At that time, the correlation rules will be not the only - or even not the primary - source for correlation suggestions. But, currently they are the only driver of the correlation algorithm. |
Correlation Rule Types
There are the following types of correlation rules:
Type | Meaning |
---|---|
|
Smart item-based correlation rule. The suggested one. |
|
Legacy filter-based correlation rule. |
|
Experimental rule, based on an evaluation of a custom expression. |
|
Rule that uses an external ID Match service. Please see this document for more information. |
Precisely speaking, there is also a composite rule that provides an aggregation of the results of its children.
However, currently it is supported only as a top-level rule, i.e., it is present automatically - without the possibility (nor need) to specify it explicitly.
|
Correlation Configuration Placement
The correlation configuration can reside in the following places:
-
A resource object type definition: either in top-level
correlation
item, or distributed into individual attribute definitions. -
An object template, currently in top-level
correlation
item. [1]
The reason for such flexibility is that in some scenarios, the correlation is bound to given type of focus objects, regardless of the origin of identity data we need to correlate. They can come from any resource or (in the future) they may come from registration or self-registration processes. In other scenarios, though, the correlation rules are specific to given resource object type.
When present, the configuration attached to the resource object type takes precedence over the one connected to the object template.
The configuration attached to the object template requires the use of archetypes. See Limitations. |
Configuration Examples
Example 1: Attribute-Bound Definition
The following is the most basic example: an attribute is mapped to a focus property that serves as a correlation item.
icfs:name
serving as a correlation attribute1 | Means that the icfs:name attribute is mapped to the name focus property. |
2 | Means that the account is correlated to the focus objects by searching for the corresponding value of name property. |
If multiple attributes are marked as correlator
, any of them matching if enough for overall match.
Technically they are evaluated separately; see rule composition for details.
If there is a need of evaluation of two attributes together (meaning that they have to match both), it is necessary to use the explicit items
correlator.
The correlation takes place before the regular inbound mappings are evaluated.
That is why there is a special inbound mapping evaluation mode:
correlation-time evaluation.
Even though it is turned off by default, the attribute-level correlator element automatically turns it on for the selected inbound mapping.
|
Example 2: Resource Object Type Bound Definition
Here we show the same logic defined at the level of the resource object type:
icfs:name
serving as a correlation attribute (defined at the level of resource object type)<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<inbound>
<target>
<path>name</path>
</target>
</inbound>
</attribute>
...
<correlation>
<correlators>
<items>
<item>
<ref>name</ref> (1)
</item>
</items>
</correlators>
</correlation>
...
</objectType>
</schemaHandling>
1 | Declaring the name to be the correlation item. |
Like we have seen in Example 1, mentioning the item name as a correlation item enables the correlation-time inbound processing for it.
|
Example 3: Object Template Based Correlation Definition
Finally, this is how the correlation can be defined at the level of an object template. Here we show a rule requiring that both given name and family name match.
<objectTemplate oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d" xmlns="...">
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
</item>
<item>
<ref>familyName</ref>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
The correlation-time inbound processing is automatically enabled also in this case. The object template must be connected to the resource object type via the archetype declared in the object type definition.[2] An example:
<resource oid="..." xmlns="...">
...
<schemaHandling>
<objectType>
...
<focus>
<type>UserType</type>
<archetypeRef oid="36d04df1-8f81-4442-b576-97b54c716245" />
</focus>
...
</objectType>
</schemaHandling>
</resource>
<archetype oid="36d04df1-8f81-4442-b576-97b54c716245" xmlns="...">
...
<archetypePolicy>
<objectTemplateRef oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d"/>
</archetypePolicy>
...
</archetype>
Example 4: Correlation for outbound resources
Smart correlation relies on inbound mapping converting resource’s attribute to a property of midPoint object. Such approach is perfect for inbound resources because it simplifies the configuration. Nevertheless, there are use cases with strictly outbound resource with existing accounts that need to be correlater. In such cases having an inbound mapping is not desired.
For this situation midPoint supports the option to configure mapping for evaluation only for correlation and not for "standard" processing (by clockwork).
<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<correlator/>
<inbound>
<target>
<path>name</path>
</target>
<use>correlation</use> (1)
</inbound>
<outbound> (2)
...
</outbound>
</attribute>
...
</objectType>
</schemaHandling>
1 | Means that the inbound mapping will be used only for correlation and otherwise won’t be processed |
2 | Represents the outbound mapping as usual |
Advanced Concepts
Multiple Correlation Rules
In more complex deployments, there may be multiple correlation rules. For example, we may want to correlate by given name, family name, date of birth, and national ID using the following rules:
Rule# | Situation | Resulting confidence |
---|---|---|
1 |
Family name, date of birth, and national ID exactly match. |
1.0 |
2 |
Given name, family name, and date of birth exactly match. |
0.4 |
3 |
The national ID exactly matches. |
0.4 |
The confidence values are described on rule composition page. |
These rules can be configured like this:
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<weight>1.0</weight> <!-- this is the default -->
</composition>
</items>
<items>
<item>
<ref>givenName</ref>
</item>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<composition>
<weight>0.4</weight>
</composition>
</items>
<items>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<weight>0.4</weight>
</composition>
</items>
</correlators>
</correlation>
</objectTemplate>
There are a lot of configuration options here.
For example, we can specify the order of rules evaluation and their "A implies B" relations that ensure the correct computation of confidence in case of rule A
implying rule B
.
Please see rule composition page for more information.
Custom Indexing
This feature is available only when using the native repository implementation. |
Sometimes, we need to base the search on specially-indexed data. For example, we could need to match only first five normalized characters of the surname. Or, we could want to take only digits into account when searching for the national ID.
These requirements can be configured like this:
<objectTemplate>
...
<item>
<ref>familyName</ref>
<indexing>
<normalization>
<steps>
<polyString> (1)
<order>1</order>
</polyString>
<prefix> (2)
<order>2</order>
<length>5</length>
</prefix>
</steps>
</normalization>
</indexing>
</item>
<item>
<ref>extension/nationalId</ref>
<indexing>
<normalization>
<name>digits</name> (3)
<steps>
<custom>
<expression>
<script>
<code>
basic.stringify(input).replaceAll("[^\\d]", "") (4)
</code>
</script>
</expression>
</custom>
</steps>
</normalization>
</indexing>
</item>
...
</objectTemplate>
1 | Applies the default PolyString normalizer to the original value. |
2 | Takes the first 5 characters of the normalized value. |
3 | Name by which this normalization can be referenced. |
4 | Removes everything except for digits. |
These indexes are then used automatically when correlating according to familyName
and extension/nationalId
, respectively.
If there are multiple normalizations defined for a given focus item (and none is defined as the default one), we can select the one to be used by mentioning it within the correlation item definition:
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>extension/nationalId</ref>
<search> (1)
<index>digits</index>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
1 | Points to the digits normalization for extension/nationalId property. |
Please see custom indexing and items
correlator for more information.
Fuzzy Searching
By default, the searching is done using "exact match" criteria, either on original values or on the ones that underwent the standard or custom normalization. Sometimes, however, we want to search for objects that have a property value somewhat similar to the value we have at hand. For example, we get an account for Jack Sparrow, but besides matching users with surname Sparrow we may want to consider also users Sparow, Sparrou, and so on; although potentially with a lower confidence value.
To do this, a fuzzy search logic was implemented. There are two methods available:
Method | Description |
---|---|
Levenshtein edit distance |
Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. (From wikipedia.) |
Trigram similarity |
Matches using the ratio of common trigrams to all trigrams in compared strings.
(See PostgreSQL documentation on |
The fuzzy search is available only when using the native repository implementation. |
An example that searches for users having given name and family name close to the provided ones. The given name has to have Levenshtein edit distance (to the provided one) at most 3. The family name has to have trigram similarity (to the provided one) at least 0.8.
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
</search>
</item>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<similarity>
<threshold>0.8</threshold>
</similarity>
</fuzzy>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
Please see fuzzy searching page for more information.
Multiple Identity Data Sources
The advanced correlation needs often go hand in hand with the situations when there are multiple sources of the identity data. For example, a university may have its Student Information System (SIS) providing data on students and faculty, Human Resources (HR) System keeping records of all staff - faculty and others, and "External persons" (EXT) system for maintaining data about visitors and other persons related to the university in a way other than being a student or employee.
While the data about a person are usually consistent, there may be situations when they differ. For example, the given name may be recorded differently in SIS and HR systems. Or the title may be forgotten to be updated in HR. An old record in the "external persons" system may be out-of-date altogether.
This situation leads to two kinds of requirements:
-
When processing data from these systems, midPoint has to somehow decide which ones are "authoritative", that is, which ones to propagate to the "official" user data stored in the repository.
-
When correlating, we may want to match data from all systems for the candidate owners. (Not only the "official" user data.)
MidPoint supports both of them. For the first one, the engineer must provide an algorithm for determination of the authoritative data source. The second one is provided transparently, by indexing the data from all the identity sources.
The following example shows how to configure givenName
, familyName
, dateOfBirth
, and nationalId
as "multi-source" properties.
They are kept separately for each source: SIS, HR, and "external persons" system.
The order of "authoritativeness" (so to say) is: SIS, HR, external, as can be seen in the defaultAuthoritativeSource
mapping.
<objectTemplate>
...
<item>
<ref>givenName</ref>
<multiSource/> (1)
</item>
<item>
<ref>familyName</ref>
<multiSource/>
</item>
<item>
<ref>extension/dateOfBirth</ref>
<multiSource/>
</item>
<item>
<ref>extension/nationalId</ref>
<multiSource/>
</item>
...
<multiSource>
<defaultAuthoritativeSource> (2)
<expression>
<script>
<code>
import com.evolveum.midpoint.util.MiscUtil
def RESOURCE_SIS_OID = '...'
def RESOURCE_HR_OID = '...'
def RESOURCE_EXT_OID = '...'
// The order of authoritativeness is: SIS, HR, external
if (identity == null) {
return null
}
def sources = identity
.collect { it.source }
.findAll { it != null }
def sis = sources.find { it.resourceRef?.oid == RESOURCE_SIS_OID }
def hr = sources.find { it.resourceRef?.oid == RESOURCE_HR_OID }
def external = sources.find { it.resourceRef?.oid == RESOURCE_EXT_OID }
MiscUtil.getFirstNonNull(sis, hr, external)
</code>
</script>
</expression>
</defaultAuthoritativeSource>
</multiSource>
</objectTemplate>
1 | Marks a property to be "multi-source" one. |
2 | A mapping that selects the most authoritative data source for a given user. |
Please see the page on multiple identity data sources for more information.
Limitations
As a general rule, when referencing a configuration related to smart correlation (including custom indexing or multi-source processing) in an object template, it must be bound to the resource object type in question via statically-defined archetype (see Listing 3 and 4 in Example 3: Object Template Based Correlation Definition).
Other limitations are mentioned on pages for individual sub-features:
Compliance
This feature is related to the following compliance frameworks: