Task Error Handling

Last modified 15 Apr 2021 20:49 +02:00
EXPERIMENTAL
This feature is experimental. It means that it is not intended for production use. The feature is not finished. It is not stable. The implementation may contain bugs, the configuration may change at any moment without any warning and it may not work at all. Use at your own risk. This feature is not covered by midPoint support. In case that you are interested in supporting development of this feature, please consider purchasing midPoint Platform subscription.
Since 4.3
This functionality is available since version 4.3.

Overview

During a task execution, failures can occur.[1] Some of them affect the whole task. A typical example is misconfigured or unreachable resource. These failures usually cause the task either to suspend, or to repeat its execution at defined time.

However, there can be also failures that are limited to individual objects. They are often caused by issues during provisioning of changes to a target resource. There can also be failures related to mapping evaluation, for example because of a programming error. Yet another class of problems are those related directly to the source objects, e.g. if there are major data quality issues that prevent these objects be processed in any reasonable way, or even technical issues preventing them from being fetched from the resource in the first place.

This document deals with the latter ones, i.e. failures related to processing of individual objects. So we will consider a situation that during task execution a subset of the objects fails to be processed, while the rest proceeds successfully. The ratio of failed objects can be negligible (like less than one in a thousand), moderate, or they may even represent the majority of all objects.

By "error handling" we mean here the mechanisms provided by midPoint to help the administrator to handle this kind of failures (errors).

The State Before

In midPoint 4.2 and before, the error handling in tasks was quite rudimentary.

For common tasks like reconciliation, recomputation, bulk actions execution, or import from resource, the only option how to handle failures that had occurred was to re-run the task. That way, all the objects were re-processed: those that had been processed successfully, were (most often) processed successfully again, and those that had failed, were (hopefully) processed correctly this time. The overhead of this solution depended on the ratio of failed objects. For very large sets of all objects (say millions) it could be considered wasteful to repeat the whole processing for relative small number of failed ones.[2]

A special category of tasks is live synchronization. Here midPoint provided two ways of handling errors:

  1. stop processing when an error is encountered, until the error is not fixed;

  2. ignore any errors and just continue processing.

The former option is safe, but can result in unnecessary delays in processing, mainly if errors occur relatively often. The latter eliminates delays, but results in missing updates and therefore resource vs. midPoint state inconsistency.

Error Handling in 4.3

As part of the midScale project, we have experimentally implemented two distinct task error handling mechanisms.

Operation Execution Records

When an object is processed by a task, an operation execution record is attached to the object.[3] That record carries an information whether the processing was successful or not. What is crucial for error handling is that these records allow us to easily select failed objects for re-processing, without the need to go through all the objects.

The practical use of this feature looks like this:

  1. A main task M is run, processing a set of objects. Some of these objects encountered errors. Respective operation execution records are created for them.

  2. Then (when system administrator decides) another task is run, aimed at these erroneous objects. Let’s call this task the recoverer (R). It has the following characteristics:

    • It usually has the same type as M. For example, if M is an import task, then R is usually import task as well. Other significant parameters, like specific bulk action to execute, should also match.[4]

    • It operates on the same set of objects (specified e.g. by resource reference, object class, kind, intent, and/or a query) but with so-called failed objects selector added. This selector specifies e.g. result states that should be matched (e.g. fatal error, partial error, warning), reference to the main task(s)[5], or the time interval when the error occurred.

  3. The recoverer then goes through failed objects, according to the original set specification combined with failed object selector, and tries to process them. The errors occurring in this task can be later handled again.

Triggers

Another option is to automatically schedule any failed object for re-processing using triggers. This mechanism is currently limited to synchronization tasks (import, reconciliation, live synchronization) and works like this:

  1. An error is encountered during processing of a resource object shadow in a task.

  2. If appropriate configuration is set, a trigger is created on the respective resource object shadow, reminding midPoint that the shadow should be synchronized again. The time interval for the trigger is configurable.

  3. After specified time arrives, the Trigger scanner task retrieves the shadow and ensures that it is re-synchronized.

  4. If the repeated processing is successful, the process ends here. If not, another trigger (with an interval that may be the same or different) is set up, and the process repeats.

  5. If the process is not successful even after specified number of repetitions, the process ends.

Which Approach to Use

Each of the options described has its own strengths and limitations. These are summarized in the table below.

Feature Operation Execution Records Triggers Comment

Applicability

Any kind of object processed by (almost) any task.

Shadows, processed by synchronization tasks.

Extra configuration required

Yes. A recoverer task should (usually) be set up, including careful specification of failed objects selector.

No. Trigger scanner takes care of everything. Only the retry strategy has to be set up in the main task.

TODO any other differences?

Configuration Samples and Reference

Operation Execution Records

An example of a recoverer task:

<task oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9"
    xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
    xmlns:ext="http://midpoint.evolveum.com/xml/ns/public/model/extension-3"
    xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">

    <name>Import: retry errors</name>

    <extension>
        <ext:kind>account</ext:kind>
        <ext:intent>default</ext:intent>
        <ext:objectclass>ri:AccountObjectClass</ext:objectclass>
        <ext:failedObjectsSelector>
            <taskRef oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9" />
            <timeFrom>2021-02-18T15:00:00.342+01:00</timeFrom>
        </ext:failedObjectsSelector>
    </extension>

    <ownerRef oid="00000000-0000-0000-0000-000000000002"/>
    <executionStatus>runnable</executionStatus>

    <handlerUri>http://midpoint.evolveum.com/xml/ns/public/model/synchronization/task/import/handler-3</handlerUri>
    <objectRef oid="a1c7dcb8-07f8-4626-bea7-f10d9df7ec9f" type="ResourceType"/>
    <recurrence>single</recurrence>
</task>

The failedObjectSelector can have the following items:

Item Description Default

status

What operation result statuses to select.

FATAL_ERROR and PARTIAL_ERROR

taskRef

What task(s) to look for when checking operation execution records?

The current task.

timeFrom

What is the earliest time of the record to be considered? This is important because the old execution records are not deleted automatically when an object is re-processed, unless one of the following occurs: either the recoverer task is the same as the main task (then the result is replaced by the new one), or a defined limit for operation execution records is reached. Then the oldest ones are purged.

Therefore, one has to set up this information carefully to avoid repeated processing of already processed objects.

No limit.

timeTo

What is the latest time of the record to be considered?

If explicit task is not specified, then it is the last start timestamp of the current task’s root. If the task is different, then there is no limit there by default.

selectionMethod

How are failed objects selected. This is to overcome some technological obstacles in object searching in the provisioning module. Normally, there is no need to override the default value.

default

The selection method has the following values:

Item Description

default

When searching for shadows via provisioning, fetchFailedObjects; otherwise narrowQuery.

narrowQuery

Simply narrow the original query by adding failed objects filter. It works with repository but usually not with provisioning.

fetchFailedObjects

Failed objects are selected using the repository. Only after that, they are fetched one-by-one via provisioning and processed. This is preferable when there is only a small percentage of failed records.

filterAfterRetrieval

Uses original query to retrieve objects from a resource. Filtering is done afterwards, i.e. before results are passed to the processing. This is preferable when there is large percentage of failed records.

Triggers

An example of configuration of error handling strategy using triggers:

<task oid="2d7f0709-3e9b-4b92-891f-c5e1428b6458"
    xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
    xmlns:ext="http://midpoint.evolveum.com/xml/ns/public/model/extension-3"
    xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">

    <name>Live Sync</name>

    <extension>
        <ext:objectclass>ri:AccountObjectClass</ext:objectclass>
    </extension>

    <ownerRef oid="00000000-0000-0000-0000-000000000002"/>
    <executionStatus>runnable</executionStatus>

    <handlerUri>http://midpoint.evolveum.com/xml/ns/public/model/synchronization/task/live-sync/handler-3</handlerUri>
    <objectRef oid="a20bb7b7-c5e9-4bbb-94e0-79e7866362e6" type="ResourceType"/>
    <recurrence>single</recurrence>

    <errorHandlingStrategy>
        <entry>
            <situation>
                <errorCategory>generic</errorCategory>
            </situation>
            <reaction>
                <retryLater>
                    <initialInterval>PT30M</initialInterval>
                    <nextInterval>PT1H</nextInterval>
                    <retryLimit>3</retryLimit>
                </retryLater>
            </reaction>
        </entry>
        <entry>
            <situation>
                <errorCategory>configuration</errorCategory>
                <status>fatal_error</status>
            </situation>
            <reaction>
                <retryLater>
                    <initialInterval>P1D</initialInterval>
                    <nextInterval>P3D</nextInterval>
                    <!-- no retry limit -->
                </retryLater>
            </reaction>
        </entry>
    </errorHandlingStrategy>
</task>

In this sample, after a generic error is encountered, the retry is attempted after 30 minutes. The next retries are done after 1 hour. The process stops after 4 attempts. However, if the error was configuration-related (with the status of FATAL_ERROR), then the initial interval is 3 days, with retries after 3 days, and without attempt limit.

Generally, the errorHandlingStrategy contains a list of entries. Each entry has:

Item Description Default

order

Order in which this entry is to be evaluated. (Related to other entries.) Smaller numbers go first. Entries with no order go last.

No order.

situation

A situation that can occur.

Any error.

reaction

What should a task do when a given situation is encountered?

ignore or stop (see below)

A situation contains the following:

Item Description Default

status

Operation result status to match. Can be either PARTIAL_ERROR or FATAL_ERROR.

If not present, we decide solely on error category. If error categories are not specified, any error matches.

errorCategory

Error category (network, security, policy, …​) to match. Note that some errors are not propagated to the level where they can be recognized by this selector. So be careful and consider this feature to be highly experimental.

If not present, we decide solely on the status. If status is not present, any error matches.

The reaction is either:

Reaction Description Note

ignore

The processing should continue, ignoring the error. E.g. for live sync tasks, this means that the sync token is advanced to the next item, effectively marking the record as processed.

This is the default strategy for the majority of tasks.

stop

The processing is stopped.

This is the default strategy for live sync and async update tasks.

retryLater

Processing of the specified account should be retried later using a trigger, as was described.

This strategy has more parameters, see below.

Notes:

  1. Names for these options may be changed in the future, to make them more compatible with error handling based on operation execution records. (They were created before, and not revised afterwards.)

  2. Operation execution recording is not influenced by these settings. So each error is recorded regardless of the value of reaction. This is why operation execution records based error handling works well with the default setting of ignore reaction (although by "ignoring" one can imagine that the error is not even recorded).

  3. Besides these options, you can specify also stopAfter property (applicable to ignore and retryLater reactions) that cause the task to be stopped after seeing specified number of error situations.

The retryLater reaction has itself the following properties:

Property Meaning The default

initialInterval

Initial retry interval.

30 minutes

nextInterval

Next retry interval, after initial attempt.

6 hours

retryLimit

Maximal number of retries to attempt.

unlimited

To conclude, the mechanisms described here are all experimental. They will be fine-tuned based on users' experiences and feedback.


1. Although the words "error" and "failure" have their precise meaning in the engineering context, we will use them interchangeably and somehow freely. In this document they denote any midPoint-detected problem in processing, represented by appropriate operation status: either FATAL_ERROR or PARTIAL_ERROR.
2. Administrators often had to resort to clever hacks, like trying to identify patterns of failures, and then formulating that patterns as object filters that were used in repeated task runs. However, this was generally tedious and applicable only in some situations.
3. Actually, there are two kinds of operation execution records: operation-level records (sometimes called "complex") and modification execution records (sometimes called "simple"). We now talk about the former ones. In midPoint 4.2 and before, we did not explicitly differentiate between these two, and the support for operation-level records was incomplete.
4. This is not a strict rule. There can be situations when, for example, the main task is a bulk action task, and the recoverer is recomputation task. Or the recoverer can use a different bulk action than was used in the main task, if needed.
5. A single recoverer can treat multiple main tasks. Also, a recoverer can be the same task as the main one, with just the selector added.