MidPrivacy: Data Provenance Prototype

Data provenance is one of the fundamental problems of data protection. Data protections regulations and practices ask for transparency and accountability. However, currently systems are seldom capable of tracing origin (provenance) of data that they are processing. This situation is hardly surprising given the complexity that data provenance brings, especially for data modelling and maintenance.

Data provenance was chosen as the primary goal of the first phase of MidPrivacy initiative because it brings a solid foundation to build full suite of privacy-enhancing features in the future.

Data Provenance

Data provenance is one of major problems of data protection and identity management in general. The problem may be summarized by a simple question "Where did the data come from?". This is a simple question, but the answer is surprisingly complex. Were the data provided by the user? When did the user provide that data? For what purpose? Were the data retrieved from the HR system? Are they bound to a specific contract? Were the data received from a third party? Were they created as a part of "social login" or a membership in identity federation? Was it a combination of several systems? Do we have conflicting data coming from different sources?

The question of data provenance is a critical problem for transparency and accountability. Data controllers and processors often lose track of data origin. But how can one implement proper data protection if data origin is not known? It would be almost impossible to demonstrate how a particular data item was obtained, that it was handled properly and that there is existing legal basis for processing of that particular item. Smaller and simpler organization can probably track this using a "paper processes". However, this approach is not feasible for larger and complex organizations. Data protection policies and processes must be automated. The problem is, that this requires to create a metadata model – we need to maintain a complex metadata about each data item of the (primary) data model. E.g. for each value of user’s e-mail address we need to know origin, creation dates, modification dates, expiration dates, processing purposes or reference to specification a lawful basis and possibly a lot of other information. An additional problem is that this meta-data set is likely to evolve in time as it needs to adapt to privacy policies and evolution of data protection regulations.

Phase Goals

Expected outcome of Data Provenance phase of midPrivacy project is prototype implementation of privacy-enhancing features in midPoint. This phase of the project is aimed at prototyping and evaluating data provenance capabilities. Existing midPoint data model functionality will be enhanced with the meta-schema capabilities to track origin (provenance) of every data item. It is likely that this will require development of new data modelling language. The project includes development of prototype user interface aimed at intuitive presentation of data provenance to a user. Overall goal of the enhancement is it improve transparency and accountability of personal data processing.

Expected project outcomes:

  • Adapt existing or design new data modelling (schema) language to support meta-schema capabilities required to support data provenance features.

  • Implement prototype libraries for processing the meta-schema to evaluate feasibility of this approach.

  • Use meta-schema to process provenance-annotated personal data. Implement prototype functionality in midPoint.

  • Implement prototype user interface to present the provenance data to the user. The purpose of this prototype is to evaluate whether the complex meta-data can be presented to a user in an understandable and intuitive way, thus supporting transparency and user intervenability in personal data protection. This prototype user interface can be used in future usability testing with a potential to be fully productized.

  • Evaluate market potential for data protection features in IDM systems in two different ways:

    • Conduct a quick study of market demand for data protection features (e.g. by using surveys, on-line and personal discussions and similar means).

    • Use prototypes created in this project as a basis for further discussions with potential customers and users, evaluating potential for full productization of data protection mechanisms in midPoint.

Blog, Articles And Other Media


Milestone Goal Planned date Status


Project start




Meta-schema prototype




Meta-schema integrated into midPoint core




Project finish


In progress


This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the NGI_TRUST grant agreement no 825618.