PolyString Normalization Configuration

Last modified 01 Jul 2024 17:57 +02:00

Introduction

PolyString is a flexible data structure that has several purposes. One of the purposes is to allow comfortable text search in international environment. The basic idea is that is the user is searching for user naiveboy123 then the users naveBOY123 and NaiveBoy123 are found. The other effect is that if someone already registered naiveboy123 username then other user cannot register naveBOY123, NaiveBoy123 and NAIVEboy123 usernames.

This functionality is sometimes achieved by using a native full text search capabilities of the database. However, those database features are often difficult to configure and they may also be somehow expensive. But most importantly of all they are specific to individual databases, there is no practical standard. As midPoint supports several databases it would be very difficult to support those capabilities in all the databases. Therefore midPoint is using a much simpler approach.

PolyString is storing the value in two different forms:

Original form (orig): the text that was entered by the user. This may contain international characters, any number of whitespace and so on. E.g. Coup D’état.
Normalized form (norm): the text that was simplified, cleaned up, transformed to canonical form or otherwise prepared for storage. E.g. coup detat.

Both orig and norm forms are stored in the database. The orig form is used for vast majority of purposes: displaying the value, editing the value and so on. However, when it comes to searching or uniqueness check then the norm value is used. Searching works like this:

Careless user enters Coup d`etat into search input field.
MidPoint normalizes the value. The result is coup detat.
MidPoint looks in the database for all entries that have norm value equals to coup detat.
An entry is found. The orig part the matching entry is displayed: Coup D’état.

Of course, this capability is a bit limited. Searching for "coup d etat" is unlikely to provide any meaningful results. There is no equivalence nor stemming, therefore it makes no sense to search for "putsch". Yet, this is nice, simple and elegant method for many practical use cases.

Normalizers

The effectiveness of the method depends heavily on the normalization algorithm. Given the right algorithm and the system will work flawlessly. However wrong normalization algorithm may cause a lot of problems. Therefore, the normalization algorithm is configurable and there is an option for a completely custom normalization algorithm.

All normalization algorithms bundled with midPoint go through the same set of steps:

Trimming (trim): removing whitespaces at the start and the end of the string.
Decomposition (nfkd): composed characters (such as ) are decomposed to the constituent parts (e and '). Unicode Normalization Form Compatibility Decomposition (NFKD) is used for this purpose.
Core normalization algorithm: e.g. removing all non-alphanumeric characters, removing all non-ASCII characters and so on.
Whitespace trimming (trimWhitespace): removing extra whitespaces and replacing with a single space. E.g. Good morning Sunshine is transformed to Good morning Sunshine.
Lowercase transform (lowercase): all characters are transformed to their lowercase equivalents.

All these steps are applied by default. But individual steps can be disabled in the normalizer configuration, therefore the function of the normalizer can be customized. There are also three options for the core normalization algorithm:

Normalizer class Description Example transforms (default configration)

Normalizer class	Description	Example transforms (default configration)
`AlphanumericPolyStringNormalizer` (default)	Keeps only (latin) alphanumeric characters. Due to NFKD decomposition the composed national characters will be converted to base latin characters.	`Gulôčka #2 v jamôčke` → `gulocka 2 v jamocke` `Coup d’état!` → `coup detat` `Сою́з 7к` → `7`
`Ascii7PolyStringNormalizer`	Keeps only (printable) ACSII7 characters, i.e. characters with unicode codes U+0020 through U+007f.	`Gulôčka #2 v jamôčke` → `gulocka #2 v jamocke` `Coup d’état!` → `coup d’etat!` `Сою́з 7к` → `7`
`PassThroughPolyStringNormalizer`	Keeps all characters (but still subject to other processing phases described above).	`Gulôčka #2 v jamôčke` → `gulôčka #2 v jamôčke` `Coup d’état!` → `coup d’état!` `Сою́з 7к` → `сою́з 7к` (the composite characters such as `ô` or `é` are decomposed)

AlphanumericPolyStringNormalizer
(default)

Keeps only (latin) alphanumeric characters.
Due to NFKD decomposition the composed national characters will be converted to base latin characters.

Gulôčka #2 v jamôčke → gulocka 2 v jamocke
Coup d’état! → coup detat
Сою́з 7к → 7

Ascii7PolyStringNormalizer

Keeps only (printable) ACSII7 characters, i.e. characters with unicode codes U+0020 through U+007f.

Gulôčka #2 v jamôčke → gulocka #2 v jamocke
Coup d’état! → coup d’etat!
Сою́з 7к → 7

PassThroughPolyStringNormalizer

Keeps all characters (but still subject to other processing phases described above).

Gulôčka #2 v jamôčke → gulôčka #2 v jamôčke
Coup d’état! → coup d’état!
Сою́з 7к → сою́з 7к
(the composite characters such as ô or é are decomposed)

Normalizer Configuration

Normalizers can be configured in system configuration object:

<systemConfiguration oid="00000000-0000-0000-0000-000000000001">
    ...
    <internals>
        <polyStringNormalizer>
            <className>Ascii7PolyStringNormalizer</className>
            <nfkd>false</nfkd>
        </polyStringNormalizer>
    </internals>
    ...
</systemConfiguration>

Individual processing steps can be turned off by setting corresponding elements (trim, nfkd, trimWhitespace, lowercase) to false. Normalizer can be specified by placing its class name to a className element. The className element may also contain fully-qualified class name of a custom normalizer code (Note: this functionality is EXPERIMENTAL).

Normalizers are initialized at system startup. The mechanism for handling change of normalizer configuration in runtime is very limited, therefore all midPoint nodes must be restarted if normalizer configuration is changed. Also, the normalizer reconfiguration affect only new values that are updated after configuration change. Existing values in the repository are unaffected. Therefore for the change to take a full effect all the data need to be updated (e.g. export and re-import of the data).

PolyString Normalization Configuration

Introduction

Normalizers

Normalizer Configuration

See Also