Metadata Usecases and Structure Design

Last modified 12 Mar 2021 10:22 +01:00

Those are the notes about the expected use of meta-data in midPoint.

  • provenance metadata : High-level information about data origin. Audience of this information are end users and business people. Therefore the information has to be understandable to them. This is going to be a multi-value structure of data origins. Data are combined in midPoint mappings, therefore each value can have several ultimate origins.

    • Reference to "business" representation of data origin. This may be organiation, organizational unit, service or anything else that makes a business sense.

    • Description of data acquisition event. This is the moment that midPoint learned about the data. It may not completely describe ultimite origin of the data, as the system that provides the data to midPoint may have received the data from another system. But it describes the origin in the best way that midPoint can know.

      • Data acquisition timestamp.

      • Identification of an acquisition mechanism that was used to obtain the data. This may be user entry, synchronization from resource, data set by REST API or it may be legacy (i.e. data that originated before we started to record the provenance) or unknown. This is different than channel. Channel is technical means how the data entered midPoint server, however acquisition mechanism has a business meaning. REST API may be in fact a proxy for other acquisition mechanisms, e.g. user entry by using a custom GUI. Also, we should have an option to accept data with metadata on REST interface. In that case channel will indicate REST, but acquisition mechanism will be set to user entry. However, this should be somehow limited, e.g. by use of authorizations or special REST configuration.

      • Identification of an actor (user or service) that was origin of the data. E.g. identification of a user that was logged in to GUI when the data were entered. Or identification of a service that initiated the REST service that set the data.

      • Reference to the resource that was used to acquire the data (if applicable).

    • In addition to acquisition we may be interested in the ultimate origin of the data. However, this part of the provenance metadata is not yet clear. It is related to data portability. This is partially defined by originRef, but not entirely. We would probably be interested in more data than originRef can provide. We do not define the details of ultimate origin for now - instead of speculating and creating a data structure that is very likely to change in the future. But once we find a way how to define it, this will be the right place to record it.

  • storage metadata : the details when data were stored in midPoint repository. See MetadataType in common-3 schema.

    • create timestamp, orginator user, channel, taskRef

    • last modification timestamp, orginator user, channel, taskRef

  • process metadata : the details how the data were processed by midPoint processes/tasks. Only present if it was approved or otherwise processed by a "process". See MetadataType in common-3 schema.

    • request timestamp, originator user, comment, channel

    • last change (create, modify) timestamp, approver, comment

    • last certification timestamp, outcome, certifier, comment

  • provisioning metadata

    • last provisioning timestamp

    • later: maybe some complex data structure that will describe how data were shared to resources. So far we have Shadow to hold that information. But shadows may disappear. Maybe we may want to have another record for that. Or maybe the solution will be to keep "dead" shadows for a long time.

  • Transformation metadata : keeping a detailed track of how the data were processed. The aim of this information is to provide very detailed informtion about data origin and transformation, mostly for diagnostics and troubleshooting. The audience is mostly system administrators, but system auditors may be also interested. This kind of metadata is likely to be transient.

    • data sources

    • mappings that processed the data

  • misc internal midPoint metadata, probably need to better charaterized and sorted:

    • Suppression reason for potential values. E.g. if there is a potential (non-existent) data value, we want to know that it was suppressed because a value override is in place. I.e. mapping did not set the value because a different value already exists. Similar case is there for weak mappings, mappings with conditions and so on.

  • assurance metadata: Data about reliability of the information. We will not be formalizing those in midPoint 4.2. The concepts and schemas are still evolving. We will leave that for metadata extensions, for every deployment to customize. If any of the customization schema should appear quite frequently, we can "standardize" that later.

    • LoA

    • LoA source/reason (for debugging and traceability)

    • verification data (e.g. signature)

  • Policy-related metadata (needs better name)

    • reason to use/store the data, consent

    • policy violation / remediation informations (timestamps, which policy, …​)

  • Conditions & limitations metadata (needs better name)

    • validity (timestamp)

    • constratins (e.g. assignment is valid as long as the user is employee)