<systemConfiguration oid="00000000-0000-0000-0000-000000000001">
...
<internals>
<polyStringNormalizer>
<className>Ascii7PolyStringNormalizer</className>
<nfkd>false</nfkd>
</polyStringNormalizer>
</internals>
...
</systemConfiguration>
PolyString Normalization Configuration
Introduction
PolyString is a flexible data structure that has several purposes.
One of the purposes is to allow comfortable text search in international environment.
The basic idea is that is the user is searching for user naiveboy123
then the users naiveBOY123
and NaiveBoy123
are found.
The other effect is that if someone already registered naiveboy123
username then other user cannot register naiveBOY123
, NaiveBoy123
and NAIVEboy123
usernames.
This functionality is sometimes achieved by using a native full text search capabilities of the database. However, those database features are often difficult to configure, and they may also be somehow expensive. But most importantly of all they are specific to individual databases, there is no practical standard. As midPoint supports several databases it would be very difficult to support those capabilities in all the databases. Therefore midPoint is using a much simpler approach.
PolyString is storing the value in two different forms:
-
Original form (orig): the text that was entered by the user. This may contain international characters, any number of whitespace and so on. E.g.
Coup D’état
. -
Normalized form (norm): the text that was simplified, cleaned up, transformed to canonical form or otherwise prepared for storage. E.g.
coup detat
.
Both orig and norm forms are stored in the database. The orig form is used for the vast majority of purposes: displaying the value, editing the value and so on. However, when it comes to searching or uniqueness check then the norm value is used. Searching works like this:
-
Careless user enters
Coup d`etat
into search input field. -
MidPoint normalizes the value. The result is
coup detat
. -
MidPoint looks in the database for all entries that have norm value equals to
coup detat
. -
An entry is found. The orig part the matching entry is displayed:
Coup D’état
.
Of course, this capability is a bit limited. Searching for "coup d etat" is unlikely to provide any meaningful results. There is no equivalence nor stemming, therefore it makes no sense to search for "putsch". Yet, this is nice, simple and elegant method for many practical use cases.
Normalizers
The effectiveness of the method depends heavily on the normalization algorithm. Given the right algorithm and the system will work flawlessly. However, wrong normalization algorithm may cause a lot of problems. Therefore, the normalization algorithm is configurable and there is an option for a completely custom normalization algorithm.
All normalization algorithms bundled with midPoint go through the same set of steps:
-
Trimming (trim): removing whitespaces at the start and the end of the string.
-
Decomposition (nfkd): composed characters (such as é) are decomposed to the constituent parts (e and '). Unicode Normalization Form Compatibility Decomposition (NFKD) is used for this purpose.
-
Core normalization algorithm: e.g. removing all non-alphanumeric characters, removing all non-ASCII characters and so on.
-
Whitespace trimming (trimWhitespace): removing extra whitespaces and replacing with a single space. E.g. Good morning Sunshine is transformed to Good morning Sunshine.
-
Lowercase transform (lowercase): all characters are transformed to their lowercase equivalents.
All these steps are applied by default. But individual steps can be disabled in the normalizer configuration, therefore the function of the normalizer can be customized. There are also three options for the core normalization algorithm:
Normalizer class | Description | Example transforms (default configration) |
---|---|---|
|
Keeps only (latin) alphanumeric characters. |
|
|
Keeps only (printable) ACSII7 characters, i.e. characters with unicode codes U+0020 through U+007f. |
|
|
Keeps all characters (but still subject to other processing phases described above). |
|
Normalizer Configuration
Normalizers can be configured in system configuration object:
Individual processing steps can be turned off by setting corresponding elements (trim
, nfkd
, trimWhitespace
, lowercase
) to false
. Normalizer can be specified by placing its class name to a className
element.
The className
element may also contain fully-qualified class name of a custom normalizer code (Note: this functionality is EXPERIMENTAL).
Normalizers are initialized at system startup. The mechanism for handling change of normalizer configuration in runtime is very limited, therefore all midPoint nodes must be restarted if normalizer configuration is changed. Also, the normalizer reconfiguration affect only new values that are updated after configuration change. Existing values in the repository are unaffected. Therefore for the change to take a full effect all the data need to be updated (e.g. export and re-import of the data).