Distribution

Last modified 23 Aug 2024 03:00 +02:00

Here we describe the definition of distribution for activities.

Not all of these options are supported for all activities! Worker threads are supported only by activities marked as multi-thread-capable. Buckets, worker tasks, and auto-scaling are supported only by activities marked as multi-node-capable. The list of activities with these attributes is here.

The following configuration options are available:

Item Meaning Default

buckets

Division of the item set into separately processable units, so-called buckets. This is to allow their processing by multiple tasks concurrently. (These are called worker tasks.) Supported by activities marked as multi-node-capable.

All items in a single bucket.

workers

Definition of how worker tasks should be created. Supported by activities marked as multi-node-capable.

All items are processed by a single task.

workerThreads

After obtaining items to be processed, each (worker) task can distribute them to multiple worker threads for processing. This property defines their number. Supported by activities marked as multi-threading-capable.

All items are processing by a single thread.

subtask

If present, the activity is executed in a specially-created subtask, devoted to their execution. This is not normally needed, except for these reasons:

  1. Visibility:
    there are task-specific counters (e.g. related to environmental performance or internal midPoint performance) that can be useful to be viewed separately for a given activity.

  2. Concurrent execution:
    if multiple activities are needed to run concurrently, they should be placed into distinct tasks.

No subtask for the activity.

subtasks

Applicable to composite activities only. If present, all the sub-activities will be executed in separate subtasks.

No subtasks for the sub-activities.

autoscaling

Whether should this activity work distribution be auto-scaled.

No auto-scaling.

Buckets

The work is divided into buckets (or, work buckets) - abstract chunks of work.

Usually the work consists of iteration over a set of objects: either stored in midPoint repository (e.g. for recomputation task) or stored on a resource (e.g. import, reconciliation or live synchronization). So, the most natural way of segmentation of the work into buckets is by defining a bucket as a set of objects for which a particular item - let us call it discriminator - has a value in a given interval. The interval can be numeric, alphanumeric, or of anything comparable (e.g. timestamps). For repository objects, OIDs can be used for segmentation as well.

There are other possibilities as well. For example, one could segment users according to the archetype, organization membership, and so on. Work buckets can be defined using arbitrary search filter(s) over the set of objects.

Basic bucket state

A major distinction for a work bucket is: is it complete or not? The work bucket is declared complete if there’s no work that can be done on it. It does not mean that all the objects were successfully processed, though. Some of them might incur failures; however, this is considered a normal situation and such objects are treated as processed. (Re-processing of such objects can be implemented in the future, if needed.) The other state is ready meaning that there is some part of the bucket (maybe all of it) that needs to be processed.

Looking at buckets' state allows us to track the progress done (at coarse-grained level), restarting the work on last known point if necessary.

Multi-node work distribution

Buckets allow us not only to track the progress, but to easily distribute the work among multiple worker tasks, with the intention of their distribution among cluster nodes.

For such multi-node scenario there is a coordinator task and worker tasks. Coordinator holds the authoritative list of buckets to be processed. Each worker tries to grab one or more buckets to work on. Such buckets are then allocated to the worker. After they are complete, the worker tries to get another bucket or buckets to work on.

Bucket segmentation definition

The segmentation is defined like this:

An example of segmentation configuration
<task>
    ...
    <distribution>
        <buckets>
            <numericSegmentation>
                <discriminator>attributes/number</discriminator>
                <numberOfBuckets>100</numberOfBuckets>
                <to>100000</to>
            </numericSegmentation>
        </buckets>
    </distribution>
</task>

This is to be read such that the discriminator (attributes/number attribute) is expected to have a numeric value from 0 to 99999 (inclusive), and we want to divide this range into 100 buckets. So the first one will contain values from 0 to 999, second one from 1000 to 1999, then 2000-2999, etc. And the last one (100th) will contain values from 99000 to 99999, inclusive.

Definition options

Current implementation supports the following segmentation definitions:

Segmentation definition Parameters Description

all definitions

discriminator

Item whose values will be used to segment objects into buckets (if applicable). Usually required.

matchingRule

Matching rule to be applied when creating filters (if applicable). Optional.

numberOfBuckets

Number of buckets to be created (if applicable). Optional.

numericSegmentation

from

Start of the processing space (inclusive). If omitted, 0 is assumed.

to

End of the processing space (exclusive). If not present, both bucketSize and numberOfBuckets must be defined and the end of processing space is determined as their product. In the future we might implement dynamic determination of this value e.g. by counting objects to be processed.

bucketSize

Size of one bucket. If not present it is computed as the total processing space divided by number of buckets (i.e. to and numberOfBuckets must be present).

stringSegmentation

boundary

position

Position(s) to which the boundary characters apply. Should be specified, because

  1. the ordering of boundary specifications is undefined,

  2. multiple definitions of the same boundary characters is not possible.

Starts at 1.

characters

Characters that make up the boundaries. These characters must be sorted.

Reserved characters: '-', '$' (to be implemented later)
Escaping character: '\'

depth

If a value N greater than 1 is specified here, boundary values are repeated N times. This means that if values of V1, V2, …​, Vk are specified, the resulting sequence is V1, V2, …​, Vk, V1, V2, …​, Vk etc, with N repetitions - so N × k values in total.

comparisonMethod

  • interval (the default)
    resulting in interval queries like item >= 'a' and item < 'b'

  • prefix
    resulting in prefix queries like item starts with 'a'. *1

  • exactMatch
    Use exact value matching. *1

*1 This is quite risky and should be used only when you are absolutely sure that boundary values cover all possible values of the discriminator.

oidSegmentation

The same as stringSegmentation but providing defaults of discriminator = # and characters = 0-9a-f (repeated depth times, if needed).

explicitSegmentation

content

Explicit content of work buckets to be used. This is useful e.g. when dealing with filter-based buckets. But any other bucket content (e.g. numeric intervals, string intervals, string prefixes) might be used here as well.

implicitSegmentation

(none) - see all definitions

Implicit content of work buckets for given kind of activity to be used.

Additional configuration for the Buckets

allocation (experimental)

bucketCreationBatch

How many buckets are to be created at once

workAllocationInitialDelay

Size of random interval for the initial delay.

workAllocationFreeBucketWaitInterval

if specified, overrides the time used to wait for free bucket(s) reclamation. This is applied when no free buckets are available but the work is not completely done.

sampling (experimental)

regular

interval

Interval of buckets in the sample (i.e. N means that each N-th bucket is selected).

sampleSize

Number of buckets in the sample. It is converted to an interval by dividing the total number of buckets (if known) by the sample size.

random

probability

Probability of including a bucket in the sample (a number between 0 and 1).

sampleSize

Approximate number of buckets in the sample. It is converted to a probability by dividing the sample size by the total number of buckets (if known).

More examples

The oidSegmentation is the easiest one to be used when dealing with repository objects. The following creates 162 = 256 segments.

Buckets defined on first two characters of the OID
<distribution>
    <buckets>
        <oidSegmentation>
            <depth>2</depth>
        </oidSegmentation>
    </buckets>
</distribution>

The following configuration provides string interval buckets:

  • less than a

  • greater or equal a, less than b

  • greater or equal b, less than c

  • …​

  • greater or equal y, less than z

  • greater or equal z

(comparison is done on normalized form of the name attribute)

Buckets defined on the first two characters of the name
<distribution>
    <buckets>
        <stringSegmentation>
            <discriminator>name</discriminator>
            <matchingRule>polyStringNorm</matchingRule>
            <boundary>
                 <!-- buckets are like: (start) -> aa, aa -> ab, ab -> ac, ..., zy -> zz, zz -> (end) -->
                <position>1</position>
                <position>2</position>
                <characters>abcdefghijklmnopqrstuvwxyz</characters>
            </boundary>
            <comparisonMethod>interval</comparisonMethod>
        </stringSegmentation>
    </buckets>
</distribution>

The following configuration provides three buckets. The first comprises identifier values less than 123. The second comprises values from 123 (inclusive) to 200 (exclusive). And the last one contains values greater than or equal to 200.

Three work buckets defined as numeric intervals
<distribution>
    <buckets>
        <explicitSegmentation>
            <discriminator>attributes/ri:identifier</discriminator>
            <content xsi:type="NumericIntervalWorkBucketContentType">
               <to>123</to>
            </content>
            <content xsi:type="NumericIntervalWorkBucketContentType">
               <from>123</from>
               <to>200</to>
            </content>
            <content xsi:type="NumericIntervalWorkBucketContentType">
               <from>200</from>
            </content>
        </explicitSegmentation>
    </buckets>
</distribution>

The following configuration provides four buckets. The first three correspond to users with employeeType of teacher, student and administrative. The last one corresponds to user with no employeeType set.

Work buckets defined on employeeType values
<distribution>
    <buckets>
        <explicitSegmentation>
            <content xsi:type="FilterWorkBucketContentType">
                <q:filter>
                    <q:text>employeeType = "teacher"</q:text>
                </q:filter>
            </content>
            <content xsi:type="FilterWorkBucketContentType">
                <q:filter>
                    <q:text>employeeType = "student"</q:text>
                </q:filter>
            </content>
            <content xsi:type="FilterWorkBucketContentType">
                <q:filter>
                    <q:text>employeeType = "administrative"</q:text>
                </q:filter>
            </content>
            <content xsi:type="FilterWorkBucketContentType">
                <q:filter>
                    <q:text>employeeType not exists</q:text>
                </q:filter>
            </content>
        </explicitSegmentation>
    </buckets>
</distribution>

Auto-scaling

The auto-scaling configuration is very simple:

Item Meaning Default

enabled

Is autoscaling enabled for this activity?

true if the autoscaling configuration exists, false otherwise. (Note that this may change in the future.)

Was this page helpful?
YES NO
Thanks for your feedback