Use Cases for Data Provenance
Use cases and "user stories" for data provenance features. The purpose of this document is to guide design activities by providing a list all the practical use cases for inspiration. We do not claim that our implementation will support all of those use cases.
This list is not complete. It is expected that it will be continually updated.
Data Protection
Accountability
We would like to know where the particular value came from or where they have been provisioned to (which system, which "third-party"). This is needed for:
-
Internal processes and audits. We need to know whether we can process the data at all. E.g. we need to declare that this particular item was retrieved from public registry, which registry it was and when it was retrieved. We may need to declare that that particular value was entered by the user.
-
Processing of Subject Access Requests (SARs) We need to report data use to the user. We should be declaring origin of the data to the user.
-
De-provisioning. Account should be deprovisioned either in compliance of organization policy or on user request. For example when the user leaves the organization and wants to delete all his data from all linked systems.
Management of Sensitive Data
There are categories of sensitive data that need special protection. Those are often national identifiers of physical persons or their equivalents (such as SSN in the U.S.). But there are also other data, such as race, political and religious views and so on. We must make sure that those data are processed carefully and that they do not leak, not even in derived form. For example, we may want to flag all values of sensitive data. And we may want to setup mappings in the system in such a way that they propagate this flag whenever a derived values is computed. Therefore if we use sensitive identifier to derive a username, value of the username will also be flagged as sensitive.
Data Assurance
Data are not facts. We cannot automatically believe any data that we receive. Some data may be user-supplied without any assurance. Other data may be checked and validated. And then there may be data that we can rely on, e.g. because we are the reliable source of the data.
The difference in data reliability is indicated by level of assurance (LOA). Data with low LOA are considered to unreliable. Data with high LOA are often considered to be facts.
However, values with different levels of assurance may be combined in a single property. For example, the user may specify his name as "Bill R. Smith" (low LOA). But we may get a different value "William Random Smith" from government registry or certificate (high LOA). Both of those values can be in fact true, but they are used for different purposes. We may need to keep all those values, but we will need to correctly maintain meta-data about origin, LOA and purposes.
Data Retention
Usually, we want to retain data, even when they are not used anymore. It might be because of auditing, traceability, statistics, or only to make the process of re-activating users more comfortable. On the other hand, we usually don’t need data older then certain threshold or we are even legally obliged to delete such data. Policies for data retention might vary depending on the source of the data and also other factors. Therefore we will need the meta-data to correctly evaluate such policies and even for providing reports about compliance.
Diagnostics
Provenance meta-data may be used to make ordinary system diagnostics much easier. There is perhaps no need to explain how can creation/modification timestamps help with diagnosing mapping problems. The indication of system of origin may be used to check whether the mappings work correctly.
We may even record the identifiers of mappings that were used to transform the value. This may provide significant advantage in "explaining" how the value was processed by the system.
Maintaining progressive user profile
User data doesn’t have to come from a single source. Especially, when we are leveraging the existing external identities of users like social identities (e.g. Google, Facebook), state-issued identities (e.g. eID) or identity federations (e.g. eduGAIN). But the same situation might happen even within a single organization (e.g. separate HR database and contact database).
The challenge is that we might get different values for the same attribute, which is in nature single-valued, like name or primary email address. To solve it, we need a set of rules which will decide which source will be used. The rules might even rely on other data provenance features like levels of assurance. Moreover, the system has to react on changes in the inbound data, including adding a new source or removing an existing one for a given user.
This also enables us to put users in control and let them choose which source will be the primary one for their main data profile. Which again leads us back to privacy protection and data control. Of course, it opens a new set of interesting problems, because some services might require not the data which a user prefers, but data with high LoA or data coming from a particular source (e.g. organization HR).
MidPrivacy survey
During the gathering information about use cases, we discovered that there is not a single one which stands out among others. That is why we have decided to involve midPoint community by asking them to fill in the survey, which will help is with an assessment of use-cases.
Unfortunately, we got only a few responses. Therefore the result cannot be used as a final image of the real word requirements but rather as an indication what midPoint community needs.
Knowing the origin of the data is the top feature that people need. The second place in priorities list is to use the metadata for debugging purposes, for example, knowing which mapping produced the value. Knowing the provisioning targets for each value is the last feature which was unambiguously marked as required.
Using data provenance for assurance or building a progressive profile of users fits in the same category. Some people would like it; others don’t need it at all. This result was expected given the nature of the features, which are mode focused on environments with multiple data sources with a different assurance of the data. Example of such an environment is higher education.
The last category of questions has the answers distributed between all options. That means some people need the features mentioned in the questions, some might use it in the future, and others don’t want to use it or have other solution deployed or planned. Features in this category contain a legal reason for data processing, consent management, policy-related metadata, sensitive data management and privacy protection of data in midPoint’s own interfaces. We are interpreting this result as unclear the moment, and we will investigate it again in the future. We need to understand if features in this category are that specific that people don’t need them or if the root cause is the knowledge how to use such features, which would lead us to write guidelines etc. Another option is that there is another solution for the features which might be technical or only administrative. In that case, we need to understand if and how can midPoint bring some additional value.