Custom Indexing

Last modified 26 Oct 2022 16:11 +02:00

EXPERIMENTAL

This feature is experimental. It means that it is not intended for production use. The feature is not finished. It is not stable. The implementation may contain bugs, the configuration may change at any moment without any warning and it may not work at all. Use at your own risk. This feature is not covered by midPoint support. In case that you are interested in supporting development of this feature, please consider purchasing midPoint Platform subscription.

Since 4.6

This functionality is available since version 4.6.

Table of Contents

Overview
Implementation
- An Example
Configuration Options
Querying
Maintenance
Limitations
Future Work
See Also

Sometimes, we need to base the search on specially-indexed data. For example, we could need to match only first five normalized characters of the surname. Or, we could want to take only digits into account when searching for the national ID. MidPoint supports these requirements using custom indexing.

This feature is available only when using the native repository implementation.

Overview

For each focus object (for example, a user), we have a special searchable container for all data that are indexed in this way. Each time the original data are modified, the content of this container is updated.

This feature can be used to search for:

data normalized in a custom way, e.g. like "take first five characters of the surname",
data that are not indexed by default, e.g. the description property,
data in multi-source.

Implementation

The container that stores the indexed data is identities/normalizedData. For each indexing (normalization) defined on a given item, it contains a value or values of the given item (or items in the multi-identity case) after the normalization has been applied.

An Example

Table 1. Sample indexing for `givenName`, `familyName`, and `costCenter` properties
#	Item	Name	Description
1	`givenName`	`polyStringNorm`	Default system `PolyString` normalization.
2	`givenName`	`polyStringNorm.prefix3`	First three characters of the default system `PolyString` normalization.
3	`familyName`	`polyStringNorm`	Default system `PolyString` normalization.
4	`costCenter`	`original`	Original value (no normalization).

Listing 1. Defining sample indexing for three properties

<objectTemplate xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
                oid="74a2112a-0ecc-4c09-818a-1d9e234e8e6f">
    <name>person</name>
    <item>
        <ref>givenName</ref>
        <indexing>
            <normalization>
                <default>true</default>
                <steps>
                    <polyString/> (1)
                </steps>
            </normalization>
            <normalization>
                <steps>
                    <polyString> (2)
                        <order>1</order>
                    </polyString>
                    <prefix>
                        <order>2</order>
                        <length>3</length>
                    </prefix>
                </steps>
            </normalization>
        </indexing>
    </item>
    <item>
        <ref>familyName</ref>
        <indexing/> (3)
    </item>
    <item>
        <ref>costCenter</ref>
        <indexing>
            <normalization>
                <steps>
                    <none/> (4)
                </steps>
            </normalization>
        </indexing>
    </item>
</objectTemplate>

1	PolyString normalization is the default one, and can be omitted. Here it is shown just for completeness.
2	However, at this place it must be present. Otherwise, we would take the first three characters of the original form.
3	This tells midPoint to index the `familyName` in the default way (`PolyString` normalization).
4	If one wants to preserve the original form, it must be explicitly specified like this.

The original and normalized values on a real user object can then look like this:

Listing 2. Original and normalized values in the real object

<user>
    ...
    <givenName>Alice</givenName>
    <familyName>Black</familyName>
    <costCenter>CCx-1/100</costCenter>
    ...
    <identities>
        <normalizedData xmlns:gen370="http://midpoint.evolveum.com/xml/ns/public/common/normalized-data-3">
            <gen370:familyName.polyStringNorm xsi:type="xsd:string">black</gen370:familyName.polyStringNorm>
            <gen370:givenName.polyStringNorm xsi:type="xsd:string">alice</gen370:givenName.polyStringNorm>
            <gen370:givenName.polyStringNorm.prefix3 xsi:type="xsd:string">ali</gen370:givenName.polyStringNorm.prefix3>
            <gen370:costCenter.original xsi:type="xsd:string">CCx-1/100</gen370:costCenter.original>
        </normalizedData>
    </identities>
</user>

In the database, the normalized values are stored in a separate JSONB column: m_focus.normalizedData. They are not part of m_object.fullObject.

Configuration Options

Custom indexing is configured in the object template by attaching indexing information to the item element. (It is also turned on by default when multi-source feature is enabled for the item.)

The following configuration options are available for each item:

Table 2. Configuration options for item indexing
Option	Description	Example
`indexedItemName`	Local item name in the `normalizedData` container. Usually it can be left unspecified, because by default, the item local name is used. (The namespace is always `http://midpoint.evolveum.com/xml/ns/public/common/normalized-data-3`.)	`givenName`
`normalization`	Set of normalizations that are applied to the given item.	Default `PolyString` normalization

Each normalization is configured using these options:

Table 3. Configuration options for item normalization
Option	Description	Example
`name`	Name of the index (normalization). It is appended to the item name. Usually it can be left unspecified, because it is derived from the normalization step(s).	`polyStringNorm`
`default`	Is this the default index (normalization) for the given item? It is necessary to specify it only if there is more than one normalization defined.	`true`
`indexedNormalizedItemName`	Overrides the generated name for the indexed item (original item name + normalization name). Should not be normally needed.	`givenName.polyStringNorm`
`steps`	How is the indexed value computed? The default is to use system-defined `PolyString` normalization method.	Use `PolyString` normalization

There are the following types of normalization steps:

Table 4. Types of normalization steps
Type	Description	Default normalized item name suffix
`none`	Does no normalization, i.e., keeps the original value intact.	`.original`
`polyString`	Applies system-defined or custom `PolyString` normalization.	`.polyStringNorm`
`prefix`	Takes first `N` characters of the value.	`prefixN`
`custom`	Applies a custom normalization expression (e.g., a Groovy script) to the value.	`custom` ^[1]

Each normalization step has the following options:

Table 5. Configuration options for a normalization step
Option	Applies to	Description
`order`	all steps	Order in which the step is to be applied. It should be specified (if there’s more than single step), because current prism structures (containers) are not guaranteed to preserve the order of their values. Steps without order value go last.
`documentation`	all steps	Technical documentation for the step.
`configuration`	`polyString`	Configuration of `PolyString` normalizer. If not specified, the one defined at the system level is used.
`length`	`prefix`	How many characters to keep.
`expression`	`custom`	Expression that transforms the value to its normalized form. Expects `input` as the original value.

Querying

The values are queried just like any others. The only difference is that their definition is dynamic, hence e.g. in Java it must be constructed manually.

Listing 3. An example normalized (indexed) item query - in Java

ItemName itemName = new ItemName(SchemaConstants.NS_NORMALIZED_DATA, "familyName.polyStringNorm");
var def = PrismContext.get().definitionFactory()
        .createPropertyDefinition(itemName, DOMUtil.XSD_STRING, null, null);

ObjectQuery query = PrismContext.get().queryFor(UserType.class)
        .itemWithDef(def,
                UserType.F_IDENTITIES,
                FocusIdentitiesType.F_NORMALIZED_DATA,
                itemName)
        .eq("green")
        .build();

In the future, it should be possible to specify the queries also in Axiom query language or XML/JSON/YAML. However, there are some issues to be resolved.

The definitions of normalized data are dynamic. Hence, such a query is not interpretable without knowing the archetype/object template of the objects in question. (It is very similar to searching by shadow attribute values; their definition is specified by resource object type.) Therefore, such a query should be always interpreted within the scope of an archetype.
In 4.6, Axiom has issues with dots in names. These are used for normalized item names.

Listing 4. An example normalized (indexed) item Axiom query - not working now, so provided for illustration purposes only

identities/normalizedData/familyName.polyStringNorm = "green"

Maintenance

The normalized data are maintained automatically by midPoint.

In the current implementation it is the model subsystem that takes care of it. This means that careless "raw" update may break the consistence of the indexed data.

If this happens, or if the definition of the indexing changes, the administrator should execute any regular operation to put things into sync again. An example of such operation is focus object recomputation.

We should consider finding (or creating) a special partial processing option that would do just this update without the overhead of the full recomputation.

Limitations

This feature is available on the native repository only.
Only string and PolyString values are currently indexable.
One must be careful when editing the data in "raw" mode and when changing the indexing definition, see Maintenance section.
The object template must be declared in the "new style" using an archetype (i.e., not in "legacy way" in the system configuration).

Future Work

In 4.6, this feature is used in the context of the correlation only. However, in theory, nothing precludes its use in more general scenarios. One of them could be, for example, searching for users right in the user list in GUI.