<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<correlator/> (2)
<inbound> (1)
<target>
<path>name</path>
</target>
</inbound>
</attribute>
...
</objectType>
</schemaHandling>
Smart Correlation
feature
This page describes midPoint feature.
Please see the feature page for more details.
|
Since 4.6
This functionality is available since version 4.6.
|
Introduction
The "smart correlation" is a mechanism to correlate identity data to existing focus objects in the repository. Typical use is e.g. the resource objects correlation during the synchronization process, where (newly discovered) accounts on a resource are synchronized to midPoint. Another use will be the correlation during manual or automated registration of new users, including self-registration.
In midPoint 4.4 and before, the only way of correlation was the use of correlation filters, with strictly binary output: either a matching object was found, or there was no match.
In midPoint 4.5, we introduced manual correlation for situations where there is a candidate match (or more candidate matches) that need to be resolved by the human operator. It worked with the external ID Match API service.
The goal for midPoint 4.6 and beyond is to provide a configurable correlation mechanism that can provide approximate matching. For short, it is called smart correlation.
Configuration
In 4.6, the correlation mechanism is based on correlation rules, technically called correlators. For example, a rule can state that "if the family name, date of birth, and the national-wide ID all match, then the identity is the same". Another rule can state that "if (only) the national-wide ID matches, then the identity is the same with the confidence level of 0.7" (whatever the number means).
In the future, we plan to provide AI-assisted correlation that will suggest correlation candidates also according to human resolution of disputed correlation situations in the history. At that time, the correlation rules will be not the only - or even not the primary - source for correlation suggestions. But, in 4.6, they are the only driver of the correlation algorithm. |
Correlation Rule Types
There are the following types of correlation rules:
Type | Meaning | Available since |
---|---|---|
|
Smart item-based correlation rule. The suggested one. |
4.6 |
|
Legacy filter-based correlation rule. |
1.7 |
|
Experimental rule, based on an evaluation of a custom expression. |
4.5 |
|
Rule that uses an external ID Match service. Please see this document for more information. |
4.5 |
Precisely speaking, there is also a composite rule that provides an aggregation of the results of its children.
However, in 4.6 it is supported only as a top-level rule, i.e., it is present automatically - without the possibility (nor need) to specify it explicitly.
|
Correlation Configuration Placement
The correlation configuration can reside in the following places:
-
A resource object type definition: either in top-level
correlation
item, or distributed into individual attribute definitions. -
An object template, currently in top-level
correlation
item. [1]
The reason for such flexibility is that in some scenarios, the correlation is bound to given type of focus objects, regardless of the origin of identity data we need to correlate. They can come from any resource or (in the future) they may come from registration or self-registration processes. In other scenarios, though, the correlation rules are specific to given resource object type.
When present, the configuration attached to the resource object type takes precedence over the one connected to the object template.
The configuration attached to the object template requires the use of archetypes. See Limitations. |
Configuration Examples
Example 1: Attribute-Bound Definition
The following is the most basic example: an attribute is mapped to a focus property that serves as a correlation item.
icfs:name
serving as a correlation attribute1 | Means that the icfs:name attribute is mapped to the name focus property. |
2 | Means that the account is correlated to the focus objects by searching for the corresponding value of name property. |
Notes:
-
The correlation takes place before the regular inbound mappings are evaluated. That is why - starting with 4.5 - there is a special inbound mapping evaluation mode: correlation-time evaluation. Even though it is turned off by default, the attribute-level
correlator
element automatically turns it on for the selected inbound mapping. -
If multiple attributes are marked as
correlator
, they are evaluated separately. If there is a need of evaluation of two attributes together (meaning that they have to match both), it is necessary to use the explicititems
correlator.
Example 2: Resource Object Type Bound Definition
Here we show the same logic defined at the level of the resource object type:
icfs:name
serving as a correlation attribute (defined at the level of resource object type)<schemaHandling>
<objectType>
...
<attribute>
<ref>icfs:name</ref>
<inbound>
<target>
<path>name</path>
</target>
</inbound>
</attribute>
...
<correlation>
<correlators>
<items>
<item>
<ref>name</ref> (1)
</item>
</items>
</correlators>
</correlation>
...
</objectType>
</schemaHandling>
1 | Declaring the name to be the correlation item. |
Like we have seen in Example 1, mentioning the item name as a correlation item enables the correlation-time inbound processing for it.
|
Example 3: Object Template Based Correlation Definition
Finally, this is how the correlation can be defined at the level of an object template. Here we show a rule requiring that both given name and family name match.
<objectTemplate oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d" xmlns="...">
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
</item>
<item>
<ref>familyName</ref>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
The correlation-time inbound processing is automatically enabled also in this case. The object template must be connected to the resource object type via the archetype declared in the object type definition.[2] An example:
<resource oid="..." xmlns="...">
...
<schemaHandling>
<objectType>
...
<focus>
<type>UserType</type>
<archetypeRef oid="36d04df1-8f81-4442-b576-97b54c716245" />
</focus>
...
</objectType>
</schemaHandling>
</resource>
<archetype oid="36d04df1-8f81-4442-b576-97b54c716245" xmlns="...">
...
<archetypePolicy>
<objectTemplateRef oid="6eb46cb4-d707-4d91-a4ae-1a081bcfe16d"/>
</archetypePolicy>
...
</archetype>
Advanced Concepts
Multiple Correlation Rules
In more complex deployments, there may be multiple correlation rules. For example, we may want to correlate by given name, family name, date of birth, and national ID using the following rules:
Rule# | Situation | Resulting confidence |
---|---|---|
1 |
Family name, date of birth, and national ID exactly match. |
1.0 |
2 |
Given name, family name, and date of birth exactly match. |
0.4 |
3 |
The national ID exactly matches. |
0.4 |
The confidence values are described on rule composition page. |
These rules can be configured like this:
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<weight>1.0</weight> <!-- this is the default -->
</composition>
</items>
<items>
<item>
<ref>givenName</ref>
</item>
<item>
<ref>familyName</ref>
</item>
<item>
<ref>extension/dateOfBirth</ref>
</item>
<composition>
<weight>0.4</weight>
</composition>
</items>
<items>
<item>
<ref>extension/nationalId</ref>
</item>
<composition>
<weight>0.4</weight>
</composition>
</items>
</correlators>
</correlation>
</objectTemplate>
There are a lot of configuration options here.
For example, we can specify the order of rules evaluation and their "A implies B" relations that ensure the correct computation of confidence in case of rule A
implying rule B
.
Please see rule composition page for more information.
Custom Indexing
This feature is available only when using the native repository implementation. |
Sometimes, we need to base the search on specially-indexed data. For example, we could need to match only first five normalized characters of the surname. Or, we could want to take only digits into account when searching for the national ID.
These requirements can be configured like this:
<objectTemplate>
...
<item>
<ref>familyName</ref>
<indexing>
<normalization>
<steps>
<polyString> (1)
<order>1</order>
</polyString>
<prefix> (2)
<order>2</order>
<length>5</length>
</prefix>
</steps>
</normalization>
</indexing>
</item>
<item>
<ref>extension/nationalId</ref>
<indexing>
<normalization>
<name>digits</name> (3)
<steps>
<custom>
<expression>
<script>
<code>
basic.stringify(input).replaceAll("[^\\d]", "") (4)
</code>
</script>
</expression>
</custom>
</steps>
</normalization>
</indexing>
</item>
...
</objectTemplate>
1 | Applies the default PolyString normalizer to the original value. |
2 | Takes the first 5 characters of the normalized value. |
3 | Name by which this normalization can be referenced. |
4 | Removes everything except for digits. |
These indexes are then used automatically when correlating according to familyName
and extension/nationalId
, respectively.
If there are multiple normalizations defined for a given focus item (and none is defined as the default one), we can select the one to be used by mentioning it within the correlation item definition:
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>extension/nationalId</ref>
<search> (1)
<index>digits</index>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
1 | Points to the digits normalization for extension/nationalId property. |
Please see custom indexing and items
correlator for more information.
Fuzzy Searching
By default, the searching is done using "exact match" criteria, either on original values or on the ones that underwent the standard or custom normalization. Sometimes, however, we want to search for objects that have a property value somewhat similar to the value we have at hand. For example, we get an account for Jack Sparrow, but besides matching users with surname Sparrow we may want to consider also users Sparow, Sparrou, and so on; although potentially with a lower confidence value.
To do this, a fuzzy search logic was implemented. There are two methods available:
Method | Description |
---|---|
Levenshtein edit distance |
Matches according to the minimum number of single-character edits (insertions, deletions or substitutions) required to change one string into the other. (From wikipedia.) |
Trigram similarity |
Matches using the ratio of common trigrams to all trigrams in compared strings.
(See PostgreSQL documentation on |
The fuzzy search is available only when using the native repository implementation. |
An example that searches for users having given name and family name close to the provided ones. The given name has to have Levenshtein edit distance (to the provided one) at most 3. The family name has to have trigram similarity (to the provided one) at least 0.8.
<objectTemplate>
...
<correlation>
<correlators>
<items>
<item>
<ref>givenName</ref>
<search>
<fuzzy>
<levenshtein>
<threshold>3</threshold>
</levenshtein>
</fuzzy>
</search>
</item>
<item>
<ref>familyName</ref>
<search>
<fuzzy>
<similarity>
<threshold>0.8</threshold>
</similarity>
</fuzzy>
</search>
</item>
</items>
</correlators>
</correlation>
</objectTemplate>
Please see fuzzy searching page for more information.
Multiple Identity Data Sources
The advanced correlation needs often go hand in hand with the situations when there are multiple sources of the identity data. For example, a university may have its Student Information System (SIS) providing data on students and faculty, Human Resources (HR) System keeping records of all staff - faculty and others, and "External persons" (EXT) system for maintaining data about visitors and other persons related to the university in a way other than being a student or employee.
While the data about a person are usually consistent, there may be situations when they differ. For example, the given name may be recorded differently in SIS and HR systems. Or the title may be forgotten to be updated in HR. An old record in the "external persons" system may be out-of-date altogether.
This situation leads to two kinds of requirements:
-
When processing data from these systems, midPoint has to somehow decide which ones are "authoritative", that is, which ones to propagate to the "official" user data stored in the repository.
-
When correlating, we may want to match data from all systems for the candidate owners. (Not only the "official" user data.)
MidPoint supports both of them. For the first one, the engineer must provide an algorithm for determination of the authoritative data source. The second one is provided transparently, by indexing the data from all the identity sources.
The following example shows how to configure givenName
, familyName
, dateOfBirth
, and nationalId
as "multi-source" properties.
They are kept separately for each source: SIS, HR, and "external persons" system.
The order of "authoritativeness" (so to say) is: SIS, HR, external, as can be seen in the defaultAuthoritativeSource
mapping.
<objectTemplate>
...
<item>
<ref>givenName</ref>
<multiSource/> (1)
</item>
<item>
<ref>familyName</ref>
<multiSource/>
</item>
<item>
<ref>extension/dateOfBirth</ref>
<multiSource/>
</item>
<item>
<ref>extension/nationalId</ref>
<multiSource/>
</item>
...
<multiSource>
<defaultAuthoritativeSource> (2)
<expression>
<script>
<code>
import com.evolveum.midpoint.util.MiscUtil
def RESOURCE_SIS_OID = '...'
def RESOURCE_HR_OID = '...'
def RESOURCE_EXT_OID = '...'
// The order of authoritativeness is: SIS, HR, external
if (identity == null) {
return null
}
def sources = identity
.collect { it.source }
.findAll { it != null }
def sis = sources.find { it.resourceRef?.oid == RESOURCE_SIS_OID }
def hr = sources.find { it.resourceRef?.oid == RESOURCE_HR_OID }
def external = sources.find { it.resourceRef?.oid == RESOURCE_EXT_OID }
MiscUtil.getFirstNonNull(sis, hr, external)
</code>
</script>
</expression>
</defaultAuthoritativeSource>
</multiSource>
</objectTemplate>
1 | Marks a property to be "multi-source" one. |
2 | A mapping that selects the most authoritative data source for a given user. |
Please see the page on multiple identity data sources for more information.
Limitations
As a general rule, when referencing a configuration related to smart correlation (including custom indexing or multi-source processing) in an object template, it must be bound to the resource object type in question via statically-defined archetype (see Listing 3 and 4 in Example 3: Object Template Based Correlation Definition).
Other limitations are mentioned on pages for individual sub-features: