<task oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9"
xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
xmlns:ext="http://midpoint.evolveum.com/xml/ns/public/model/extension-3"
xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">
<name>Import: retry errors</name>
<extension>
<ext:kind>account</ext:kind>
<ext:intent>default</ext:intent>
<ext:objectclass>ri:AccountObjectClass</ext:objectclass>
<ext:failedObjectsSelector>
<taskRef oid="e06f3f5c-4acc-4c6a-baa3-5c7a954ce4e9" />
<timeFrom>2021-02-18T15:00:00.342+01:00</timeFrom>
</ext:failedObjectsSelector>
</extension>
<ownerRef oid="00000000-0000-0000-0000-000000000002"/>
<executionStatus>runnable</executionStatus>
<handlerUri>http://midpoint.evolveum.com/xml/ns/public/model/synchronization/task/import/handler-3</handlerUri>
<objectRef oid="a1c7dcb8-07f8-4626-bea7-f10d9df7ec9f" type="ResourceType"/>
<recurrence>single</recurrence>
</task>
Task Error Handling
Since 4.3
This functionality is available since version 4.3.
|
Overview
During a task execution, failures can occur.[1] Some of them affect the whole task. A typical example is misconfigured or unreachable resource. These failures usually cause the task either to suspend, or to repeat its execution at defined time.
However, there can be also failures that are limited to individual objects. They are often caused by issues during provisioning of changes to a target resource. There can also be failures related to mapping evaluation, for example because of a programming error. Yet another class of problems are those related directly to the source objects, e.g. if there are major data quality issues that prevent these objects be processed in any reasonable way, or even technical issues preventing them from being fetched from the resource in the first place.
This document deals with the latter ones, i.e. failures related to processing of individual objects. So we will consider a situation that during task execution a subset of the objects fails to be processed, while the rest proceeds successfully. The ratio of failed objects can be negligible (like less than one in a thousand), moderate, or they may even represent the majority of all objects.
By "error handling" we mean here the mechanisms provided by midPoint to help the administrator to handle this kind of failures (errors).
The State Before
In midPoint 4.2 and before, the error handling in tasks was quite rudimentary.
For common tasks like reconciliation, recomputation, bulk actions execution, or import from resource, the only option how to handle failures that had occurred was to re-run the task. That way, all the objects were re-processed: those that had been processed successfully, were (most often) processed successfully again, and those that had failed, were (hopefully) processed correctly this time. The overhead of this solution depended on the ratio of failed objects. For very large sets of all objects (say millions) it could be considered wasteful to repeat the whole processing for relative small number of failed ones.[2]
A special category of tasks is live synchronization. Here midPoint provided two ways of handling errors:
-
stop processing when an error is encountered, until the error is not fixed;
-
ignore any errors and just continue processing.
The former option is safe, but can result in unnecessary delays in processing, mainly if errors occur relatively often. The latter eliminates delays, but results in missing updates and therefore resource vs. midPoint state inconsistency.
Error Handling in 4.3
As part of the midScale project, we have experimentally implemented two distinct task error handling mechanisms.
Operation Execution Records
When an object is processed by a task, an operation execution record is attached to the object.[3] That record carries an information whether the processing was successful or not. What is crucial for error handling is that these records allow us to easily select failed objects for re-processing, without the need to go through all the objects.
The practical use of this feature looks like this:
-
A main task
M
is run, processing a set of objects. Some of these objects encountered errors. Respective operation execution records are created for them. -
Then (when system administrator decides) another task is run, aimed at these erroneous objects. Let’s call this task the recoverer (
R
). It has the following characteristics:-
It usually has the same type as
M
. For example, ifM
is an import task, thenR
is usually import task as well. Other significant parameters, like specific bulk action to execute, should also match.[4] -
It operates on the same set of objects (specified e.g. by resource reference, object class, kind, intent, and/or a query) but with so-called failed objects selector added. This selector specifies e.g. result states that should be matched (e.g. fatal error, partial error, warning), reference to the main task(s)[5], or the time interval when the error occurred.
-
-
The recoverer then goes through failed objects, according to the original set specification combined with failed object selector, and tries to process them. The errors occurring in this task can be later handled again.
For midPoint 4.4 please have look at failed objects selector in object set specification. |
Triggers
Another option is to automatically schedule any failed object for re-processing using triggers. This mechanism is currently limited to synchronization tasks (import, reconciliation, live synchronization) and works like this:
-
An error is encountered during processing of a resource object shadow in a task.
-
If appropriate configuration is set, a trigger is created on the respective resource object shadow, reminding midPoint that the shadow should be synchronized again. The time interval for the trigger is configurable.
-
After specified time arrives, the
Trigger scanner
task retrieves the shadow and ensures that it is re-synchronized. -
If the repeated processing is successful, the process ends here. If not, another trigger (with an interval that may be the same or different) is set up, and the process repeats.
-
If the process is not successful even after specified number of repetitions, the process ends.
Which Approach to Use
Each of the options described has its own strengths and limitations. These are summarized in the table below.
Feature | Operation Execution Records | Triggers | Comment |
---|---|---|---|
Applicability |
Any kind of object processed by (almost) any task. |
Shadows, processed by synchronization tasks. |
|
Extra configuration required |
Yes. A recoverer task should (usually) be set up, including careful specification of failed objects selector. |
No. Trigger scanner takes care of everything. Only the retry strategy has to be set up in the main task. |
TODO any other differences?
Configuration Samples and Reference
Operation Execution Records
An example of a recoverer task:
The failedObjectSelector
can have the following items:
Item | Description | Default |
---|---|---|
|
What operation result statuses to select. |
|
|
What task(s) to look for when checking operation execution records? |
The current task. |
|
What is the earliest time of the record to be considered? This is important because the old execution records are not deleted automatically when an object is re-processed, unless one of the following occurs: either the recoverer task is the same as the main task (then the result is replaced by the new one), or a defined limit for operation execution records is reached. Then the oldest ones are purged. Therefore, one has to set up this information carefully to avoid repeated processing of already processed objects. |
No limit. |
|
What is the latest time of the record to be considered? |
If explicit task is not specified, then it is the last start timestamp of the current task’s root. If the task is different, then there is no limit there by default. |
|
How are failed objects selected. This is to overcome some technological obstacles in object searching in the provisioning module. Normally, there is no need to override the default value. |
|
The selection method has the following values:
Item | Description |
---|---|
|
When searching for shadows via provisioning, |
|
Simply narrow the original query by adding failed objects filter. It works with repository but usually not with provisioning. |
|
Failed objects are selected using the repository. Only after that, they are fetched one-by-one via provisioning and processed. This is preferable when there is only a small percentage of failed records. |
|
Uses original query to retrieve objects from a resource. Filtering is done afterwards, i.e. before results are passed to the processing. This is preferable when there is large percentage of failed records. |
Triggers
An example of configuration of error handling strategy using triggers:
<task oid="2d7f0709-3e9b-4b92-891f-c5e1428b6458"
xmlns="http://midpoint.evolveum.com/xml/ns/public/common/common-3"
xmlns:ri="http://midpoint.evolveum.com/xml/ns/public/resource/instance-3">
<name>Live Sync</name>
<ownerRef oid="00000000-0000-0000-0000-000000000002"/>
<executionState>runnable</executionState>
<activity>
<work>
<liveSynchronization>
<resourceObjects>
<resourceRef oid="a20bb7b7-c5e9-4bbb-94e0-79e7866362e6" />
<objectclass>ri:AccountObjectClass</objectclass>
</resourceObjects>
</liveSynchronization>
</work>
<controlFlow>
<errorHandling>
<entry>
<situation>
<errorCategory>generic</errorCategory>
</situation>
<reaction>
<retryLater>
<initialInterval>PT30M</initialInterval>
<nextInterval>PT1H</nextInterval>
<retryLimit>3</retryLimit>
</retryLater>
</reaction>
</entry>
<entry>
<situation>
<errorCategory>configuration</errorCategory>
<status>fatal_error</status>
</situation>
<reaction>
<retryLater>
<initialInterval>PT1D</initialInterval>
<nextInterval>PT3D</nextInterval>
<!-- no retry limit -->
</retryLater>
</reaction>
</entry>
</errorHandling>
</controlFlow>
</activity>
</task>
In this sample, after a generic error is encountered, the retry is attempted after 30 minutes. The next retries
are done after 1 hour. The process stops after 4 attempts. However, if the error was configuration-related
(with the status of FATAL_ERROR
), then the initial interval is 1 day, with retries after 3 days,
and without attempt limit.
Generally, the errorHandlingStrategy
contains a list of entries. Each entry has:
Item | Description | Default |
---|---|---|
|
Order in which this entry is to be evaluated. (Related to other entries.) Smaller numbers go first. Entries with no order go last. |
No order. |
|
A situation that can occur. |
Any error. (Not same as errorCategory = generic) |
|
What should a task do when a given situation is encountered? |
|
A situation
contains the following:
Item | Description | Default |
---|---|---|
|
Operation result status to match. Can be either PARTIAL_ERROR or FATAL_ERROR. |
If not present, we decide solely on error category. If error categories are not specified, any error matches. |
|
Error category (network, security, policy, …) to match. Note that some errors are not propagated to the level where they can be recognized by this selector. So be careful and consider this feature to be highly experimental. |
If not present, we decide solely on the status. If status is not present, any error matches. |
The reaction
is either:
Reaction | Description | Note |
---|---|---|
|
The processing should continue, ignoring the error. E.g. for live sync tasks, this means that the sync token is advanced to the next item, effectively marking the record as processed. |
This is the default strategy for the majority of tasks. |
|
The processing is stopped. |
This is the default strategy for live sync and async update tasks. |
|
Processing of the specified account should be retried later using a trigger, as was described. |
This strategy has more parameters, see below. |
Notes:
-
Names for these options may be changed in the future, to make them more compatible with error handling based on operation execution records. (They were created before, and not revised afterwards.)
-
Operation execution recording is not influenced by these settings. So each error is recorded regardless of the value of
reaction
. This is why operation execution records based error handling works well with the default setting ofignore
reaction (although by "ignoring" one can imagine that the error is not even recorded). -
Besides these options, you can specify also
stopAfter
property (applicable toignore
andretryLater
reactions) that cause the task to be stopped after seeing specified number of error situations.
The retryLater
reaction has itself the following properties:
Property | Meaning | The default |
---|---|---|
|
Initial retry interval. |
30 minutes |
|
Next retry interval, after initial attempt. |
6 hours |
|
Maximal number of retries to attempt. |
unlimited |
To conclude, the mechanisms described here are all experimental. They will be fine-tuned based on users' experiences and feedback. |
FATAL_ERROR
or PARTIAL_ERROR
.