Custom Indexing

Last modified 14 Sep 2022 11:49 +02:00
EXPERIMENTAL
This feature is experimental. It means that it is not intended for production use. The feature is not finished. It is not stable. The implementation may contain bugs, the configuration may change at any moment without any warning and it may not work at all. Use at your own risk. This feature is not covered by midPoint support. In case that you are interested in supporting development of this feature, please consider purchasing midPoint Platform subscription.
Since 4.6
This functionality is available since version 4.6.

Sometimes, we need to base the search on specially-indexed data. For example, we could need to match only first five normalized characters of the surname. Or, we could want to take only digits into account when searching for the national ID. MidPoint supports these requirements using custom indexing.

This feature is available only when using the native repository implementation.

TODO Decide on the "experimental" vs "production-ready" status of this work. The implementation itself should be reliable enough. However, there are some open questions regarding the configuration structures.

Overview

For each focus object (for example, a user), we have a special searchable container for all data that are indexed in this way. Each time the original data are modified, the content of this container is updated.

This feature can be used to search for:

  1. data normalized in a custom way, e.g. like "take first five characters of the surname",

  2. data that are not indexed by default, e.g. the description property,

  3. data in multi-source.

Implementation

The container that stores the indexed data is identities/normalizedData. For each indexing (normalization) defined on a given item, it contains a value or values of the given item (or items in the multi-identity case) after the normalization has been applied.

An Example

Table 1. Sample indexing for givenName, familyName, and costCenter properties
# Item Name Description

1

givenName

polyStringNorm

Default system PolyString normalization.

2

givenName

polyStringNorm.prefix3

First three characters of the default system PolyString normalization.

3

familyName

polyStringNorm

Default system PolyString normalization.

4

costCenter

original

Original value (no normalization).

Listing 1. Defining sample indexing for three properties
<objectTemplate xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
                oid="74a2112a-0ecc-4c09-818a-1d9e234e8e6f">
    <name>person</name>
    <item>
        <ref>givenName</ref>
        <indexing>
            <normalization>
                <default>true</default>
                <steps>
                    <polyString/> (1)
                </steps>
            </normalization>
            <normalization>
                <steps>
                    <polyString> (2)
                        <order>1</order>
                    </polyString>
                    <prefix>
                        <order>2</order>
                        <length>3</length>
                    </prefix>
                </steps>
            </normalization>
        </indexing>
    </item>
    <item>
        <ref>familyName</ref>
        <indexing/> (3)
    </item>
    <item>
        <ref>costCenter</ref>
        <indexing>
            <normalization>
                <steps>
                    <none/> (4)
                </steps>
            </normalization>
        </indexing>
    </item>
</objectTemplate>
1 PolyString normalization is the default one, and can be omitted. Here it is shown just for completeness.
2 However, at this place it must be present. Otherwise, we would take the first three characters of the original form.
3 This tells midPoint to index the familyName in the default way (PolyString normalization).
4 If one wants to preserve the original form, it must be explicitly specified like this.

The original and normalized values on a real user object can then look like this:

Listing 2. Original and normalized values in the real object
<user>
    ...
    <givenName>Alice</givenName>
    <familyName>Black</familyName>
    <costCenter>CCx-1/100</costCenter>
    ...
    <identities>
        <normalizedData xmlns:gen370="http://midpoint.evolveum.com/xml/ns/public/common/normalized-data-3">
            <gen370:familyName.polyStringNorm xsi:type="xsd:string">black</gen370:familyName.polyStringNorm>
            <gen370:givenName.polyStringNorm xsi:type="xsd:string">alice</gen370:givenName.polyStringNorm>
            <gen370:givenName.polyStringNorm.prefix3 xsi:type="xsd:string">ali</gen370:givenName.polyStringNorm.prefix3>
            <gen370:costCenter.original xsi:type="xsd:string">CCx-1/100</gen370:costCenter.original>
        </normalizedData>
    </identities>
</user>

In the database, the normalized values are stored in a separate JSONB column: m_focus.normalizedData. They are not part of m_object.fullObject.

Configuration Options

Custom indexing is configured in the object template by attaching indexing information to the item element. (It is also turned on by default when multi-source feature is enabled for the item.)

The following configuration options are available for each item:

Table 2. Configuration options for item indexing
Option Description Example

indexedItemName

Local item name in the normalizedData container. Usually it can be left unspecified, because by default, the item local name is used. (The namespace is always http://midpoint.evolveum.com/xml/ns/public/common/normalized-data-3.)

givenName

normalization

Set of normalizations that are applied to the given item.

Default PolyString normalization

Each normalization is configured using these options:

Table 3. Configuration options for item normalization
Option Description Example

name

Name of the index (normalization). It is appended to the item name. Usually it can be left unspecified, because it is derived from the normalization step(s).

polyStringNorm

default

Is this the default index (normalization) for the given item? It is necessary to specify it only if there is more than one normalization defined.

true

indexedNormalizedItemName

Overrides the generated name for the indexed item (original item name + normalization name). Should not be normally needed.

givenName.polyStringNorm

steps

How is the indexed value computed? The default is to use system-defined PolyString normalization method.

Use PolyString normalization

There are the following types of normalization steps:

Table 4. Types of normalization steps
Type Description Default normalized item name suffix

none

Does no normalization, i.e., keeps the original value intact.

.original

polyString

Applies system-defined or custom PolyString normalization.

.polyStringNorm

prefix

Takes first N characters of the value.

prefixN

custom

Applies a custom normalization expression (e.g., a Groovy script) to the value.

custom [1]

Each normalization step has the following options:

Table 5. Configuration options for a normalization step
Option Applies to Description

order

all steps

Order in which the step is to be applied. It should be specified (if there’s more than single step), because current prism structures (containers) are not guaranteed to preserve the order of their values. Steps without order value go last.

documentation

all steps

Technical documentation for the step.

configuration

polyString

Configuration of PolyString normalizer. If not specified, the one defined at the system level is used.

length

prefix

How many characters to keep.

expression

custom

Expression that transforms the value to its normalized form. Expects input as the original value.

Querying

The values are queried just like any others. The only difference is that their definition is dynamic, hence e.g. in Java it must be constructed manually.

Listing 3. An example normalized (indexed) item query - in Java
ItemName itemName = new ItemName(SchemaConstants.NS_NORMALIZED_DATA, "familyName.polyStringNorm");
var def = PrismContext.get().definitionFactory()
        .createPropertyDefinition(itemName, DOMUtil.XSD_STRING, null, null);

ObjectQuery query = PrismContext.get().queryFor(UserType.class)
        .itemWithDef(def,
                UserType.F_IDENTITIES,
                FocusIdentitiesType.F_NORMALIZED_DATA,
                itemName)
        .eq("green")
        .build();

In the future, it should be possible to specify the queries also in Axiom query language or XML/JSON/YAML. However, there are some issues to be resolved.

  1. The definitions of normalized data are dynamic. Hence, such a query is not interpretable without knowing the archetype/object template of the objects in question. (It is very similar to searching by shadow attribute values; their definition is specified by resource object type.) Therefore, such a query should be always interpreted within the scope of an archetype.

  2. In 4.6, Axiom has issues with dots in names. These are used for normalized item names.

Listing 4. An example normalized (indexed) item Axiom query - not working now, so provided for illustration purposes only
identities/normalizedData/familyName.polyStringNorm = "green"

Maintenance

The normalized data are maintained automatically by midPoint.

In the current implementation it is the model subsystem that takes care of it. This means that careless "raw" update may break the consistence of the indexed data.

If this happens, or if the definition of the indexing changes, the administrator should execute any regular operation to put things into sync again. An example of such operation is focus object recomputation.

We should consider finding (or creating) a special partial processing option that would do just this update without the overhead of the full recomputation.

Limitations

  1. This feature is available on the native repository only.

  2. Only string and PolyString values are currently indexable.

  3. One must be careful when editing the data in "raw" mode and when changing the indexing definition, see Maintenance section.

  4. The object template must be declared in the "new style" using an archetype (i.e., not in "legacy way" in the system configuration).

Future Work

In 4.6, this feature is used in the context of the correlation only. However, in theory, nothing precludes its use in more general scenarios. One of them could be, for example, searching for users right in the user list in GUI.


1. it is advised to provide a specific name