MidPrivacy: Data Provenance Prototype
Data provenance is one of the fundamental problems of data protection. Data protection regulations and practices ask for transparency and accountability. However, currently systems are seldom capable of tracing origin (provenance) of data that they are processing. This situation is hardly surprising given the complexity that data provenance brings, especially for data modelling and maintenance.
Data provenance was chosen as the primary goal of the first phase of MidPrivacy initiative because it brings a solid foundation to build full suite of privacy-enhancing features in the future.
Data Provenance
Data provenance is one of major problems of data protection and identity management in general. The problem may be summarized by a simple question "Where did the data come from?". This is a simple question, but the answer is surprisingly complex. Were the data provided by the user? When did the user provide that data? For what purpose? Were the data retrieved from the HR system? Are they bound to a specific contract? Were the data received from a third party? Were they created as a part of "social login" or a membership in identity federation? Was it a combination of several systems? Do we have conflicting data coming from different sources?
The question of data provenance is a critical problem for transparency and accountability. Data controllers and processors often lose track of data origin. But how can one implement proper data protection if data origin is not known? It would be almost impossible to demonstrate how a particular data item was obtained, that it was handled properly and that there is existing legal basis for processing of that particular item. Smaller and simpler organization can probably track this using a "paper processes". However, this approach is not feasible for larger and complex organizations. Data protection policies and processes must be automated. The problem is, that this requires to create a metadata model – we need to maintain a complex metadata about each data item of the (primary) data model. E.g. for each value of user’s e-mail address we need to know origin, creation dates, modification dates, expiration dates, processing purposes or reference to specification a lawful basis and possibly a lot of other information. An additional problem is that this meta-data set is likely to evolve in time as it needs to adapt to privacy policies and evolution of data protection regulations.
Phase Goals and Outcomes
Outcome of Data Provenance phase of midPrivacy project is prototype implementation of privacy-enhancing features in midPoint. This phase of the project was aimed at prototyping and evaluating data provenance capabilities. Existing midPoint data model functionality was enhanced with the metadata schema capabilities to track origin (provenance) of every data item. Such changes required development of new data modelling language: Axiom. The project included development of prototype user interface aimed at intuitive presentation of data provenance to a user. Overall goal of the enhancement is it improve transparency and accountability of personal data processing.
Project outcomes:
-
Design of Axiom, a new data modelling (schema) language to support metadata schema capabilities required to support data provenance features.
-
Implementation of prototype libraries for processing the metadata schema to evaluate feasibility of this approach.
-
Use metadata schema to process provenance-annotated personal data. Implementation of prototype functionality in midPoint.
-
Implementation of prototype user interface to present the provenance data to the user. The purpose of this prototype is to evaluate whether the complex meta-data can be presented to a user in an understandable and intuitive way, thus supporting transparency and user intervenability in personal data protection. This prototype user interface can be used in future usability testing with a potential to be fully productized.
-
Evaluation of market potential for data protection features in IDM systems in two different ways:
-
Quick study of market demand for data protection features (by using surveys, on-line and personal discussions and similar means).
-
Use of prototypes created in this project as a basis for further discussions with potential customers and users, evaluating potential for full productization of data protection mechanisms in midPoint.
-
Functionality developed in the scope of this project is part of midPoint 4.2 release. As this is a prototyping project, the functionality is not considered to be production-ready and it is formally considered to be experimental.
Please see outcomes document for full details about project outcomes.
Documents
-
Axiom Data Modeling Language
-
Metadata Processing
Blog, Articles And Other Media
-
Evolveum Blog
-
Data Provenance Workshop - workshop invitation blog
-
Data Provenance Workshop Results - workshop wrap-up blog
-
Data Provenance Provenance Finished - project wrap-up blog
-
Project Management Documents
-
Data Provenance and Metadata Management in IdM on-line workshop (September 2020)
Timeline
Milestone | Goal | Planned date | Status |
---|---|---|---|
START |
Project start |
15-March-2020 |
DONE |
M1 |
Meta-schema prototype |
15-May-2020 |
DONE |
M2 |
Meta-schema integrated into midPoint core |
15-July-2020 |
DONE |
FINISH |
Project finish |
15-September-2020 |
DONE |
Funding
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the NGI_TRUST grant agreement no 825618.