Following open questions, issues and ideas remain unresolved (or partially resolved) after the end of this project phase. This is a material for potential future work.
Metadata And Provenance Research
It looks like metadata are inherently multi-valued. The same value may come from several sources. Metadata needs to be maintained for each origin of the value separately. In hindsight, this seems to be a very natural property of the metadata. Also, work on our prototype confirms that this multi-valued notion of metadata is necessary. However, we have not noticed any mentions about this inherent metadata multiplicity when we have done our initial research. Similarly, we have not been able to locate any substantial information on data provenance schemas or practical working systems utilizing data provenance mechanisms. This lack of information about metadata (and provenance metadata in particular) seems to be quite surprising. It may suggest that this field is much less researched that we have originally assumed. Therefore, the aspects of metadata design perhaps deserve more attention from the scientific point of view. It would be worth to revisit the theoretical aspects of data provenance once again and perhaps record our findings in a form of research paper. Unfortunately, this was not meant to be a research project and there is insufficient funding to follow this path of inquiry at the moment.
Metadata Schema Evolution
Metadata multiplicity and the concept of yields is opening up entirely new view on metadata schemas. For example the concept of storage metadata may not be a good representation of the data processing logic any more. The metadata are now separated into yields, even though the data are physically stored in the database at a single moment.
While some metadata structures (e.g.
transformation) seems to fit well, other (e.g.
storage) may need later revisions.
Presentation of metadata is hard. We have expected that, but we have uncovered the depths of the problem that we haven’t expected. Metadata multiplicity is perhaps the most surprising issue and it deeply affects the overall user experience of metadata presentation. We have found a way how to present metadata in midPoint. Despite the fact that our solution went through several UX improvement iterations, we are not entirely happy about the (lack of) ease of use and intuitive understanding of metadata. User experience of metadata presentation seems to be a topic that would deserve further exploration.
However, there is yet another user experience concern that we have discovered, which is related to data portability. It is difficult to track the user entry origin of data. For example, we may already know user’s full name from HR system or some other automated source. We can present the full name to the user to review the data, e.g. as part of ordinary user profile form. However, if the full name is correct, the user will not change it and therefore there will be no user entry origin recorded for that particular value. This may work well in simple use cases. However, there are cases where it does not work well, e.g. account splitting or removal of data source. Further exploration of the use cases and especially their user experience is needed to address this issue.
There seems to be significant overlap of data provenance and data protection. We have suspected that data provenance is one of the basic blocks of data protection mechanisms, but the extent of "entanglement" of these two fields still surprised us. It looks like data provenance is deeply related to bases for data processing. However, we have done only a very shallow analysis of this relationship. Further research into data protection mechanism would be desirable to uncover the details.
Data provenance has an obvious and significant overlap to data portability. However, we have not dealt with data portability specifically. Our exploration of data provenance was limited to the "scenery" as it was seen by midPoint. We have considered the systems that are directly connected to midPoint, but we haven’t explored any further.
We have prepared a proposal to NGI Data Portability and Services Incubator (DAPSI), with an intent to follow-up on our work with metadata. Unfortunately, our proposal was not selected. We hope that we will have better success in securing funding in the future, as our work suggests that there may be interesting opportunities for exploration in data portability area.
Metadata Queries And Indexing
Computing metadata is one thing, but storing, indexing and searching the metadata is another thing entirely. This phase of the project is only partially concerned with storing metadata and it is not concerned with indexing and searching metadata at all. Indexing and searching metadata brings its own unique set of questions, such as forming queries that refer to metadata. Axiom item paths can refer inframodel items, therefore metadata can theoretically be referenced in search queries. However, this aspect was not explored as part of this prototype. Yet another problem is related to metadata schema. It looks like the metadata are fundamentally multi-value. However, common database indexing mechanisms are problematic when applied to multi-value data. This problem has to be explored in depth. In addition to that, there may be a concern on metadata storage and indexing overhead. As metadata can be maintained for every value, the metadata-enriched model may be very large. Storage and indexing of such a large model may be too expensive. Therefore it is likely that selective storage/indexing for metadata will be needed.
Metadata for non-complex values can be already quite complicated. However, there is an additional degree of complexity when considering meta-data of complex (structured) values. In Axiom/Prism terminology, those are object and container metadata. Complex values contain another values within them. For example modification timestamp of an object should be set to the time of the latest modification of any item that the object contains. This phase of the project did not explore this issues in depth, as there is already a pre-existing mechanism for object metadata maintenance in midPoint. However, complete metadata solution should focus on this area as well.
MidPoint Metadata Compatibility
MidPoint has a pre-existing simple metadata maitenance mechanism that is applied to several selected items. Due to the compatibility reasons, this mechanism had to be maintained and could not be completely replaced with a new mechanism. This phase of the project has discovered several interesting incompatibilities. The multi-value nature of metadata was one of the most interesting discoveries. It is also perhaps the most troubling one, as many pre-existing systems assume single-value character of metadata. The single-value character of metadata is assumed by midPoint 4.1 and earlier. We had to keep compatibility with existing midPoint installations, therefore the pre-existing system cannot be dismantled and relaced. As of midPoint 4.2, the two systems co-exist side by side. Further research is needed to align those two systems completely. The questions that are outlined above are essential for this alignment to be possible.
Metadata authorizations are currently implemented in a very simple way: any person authorized to access the data is also authorized to access all associated metadata. This approach is good enough for the prototype, however, it is obviously too simple for secure real-world use. We have designed authorization approach that could fit the needs for metadata authorizations, but it is not simple. As metadata add entirely new dimension to data, authorization is also affected by this complexity. We hope that the designed authorization approach could work, but it was not implemented and tried in practice.
Axiom has to be further evolved to become (more) mature language. Using Axiom to define metadata schemas is a great validation of the concepts, especially the inframodel concept. It is also a validation that basic mechanisms of Axiom language work. However, there is still a lot to work on for Axiom to become a mature language:
Support for enumerations.
Support for ordered data types (lists/arrays).
Support for heterogeneous multi-valued data structure that may contains mixed elements (a.k.a. "heterolists").
Make sure that Axiom is "web-friendly" and that it is suitable for building REST-like interfaces.
Smooth out JSON/YAML syntax. There are still several alternative notations, e.g.
@context. May be worth considering at least partial interoperability with JSON-LD or other JSON variants.
Final decision on small Axiom language details. E.g.
multivalue. We are also considering making semicolon optional, as well as quotes around single-word strings.
Introduction of dynamic data type: data type that can be fully specified at runtime.
Versioning and deprecation details. Some versioning concepts are already specified, but the compatibility and deprecation rules for Axiom model versioning are not yet entirely clear.
Support for custom annotations in Axiom models.
Deltas: are they native (hardcoded) data structures in Axiom? Are deltas modeled in Axiom as any other data structure? Or do deltas rather belong to Prism?
Decide about query language. Should it be part of Axiom? Or should it rather belong to Prism?
Display properties. Should Axiom specify how a particular item should be displayed (e.g. display name, internationalization, etc.). Or does this rather belong to Prism?
Major part of future Axiom work would be documentation. There is Axiom specification, but it is mostly supposed to be reference document. More documents, such as tutorials, guides and examples will be needed for easier understanding of Axiom concepts.
Prism Cleanup And Evolution
Prism data library is built on top of Axiom language. Prism was used in midPoint for many years, but it was built on XSD instead of Axiom. The XSD was a major limitation. Switch to Axiom opened up exciting opportunities and enabled further development of Prism. Such as:
Solving the container identifier problem. Multi-value containers need identifiers to properly address each container value (e.g. in deltas). We have tried to create automatic numeric container identifiers in Prism 3.x with XSD. Container identifier was transparently assigned by the system, which was supposed to ensure its uniqueness. However, that did not work well and we were not able to find a satisfactory solution. As Prism is built with eventual consistency in mind, there were always risks of identifier conflicts. The solution would be to have application-assigned and perhaps application-meaningful identifiers. However, that would be quite cumbersome and difficult to use with XSD. Axiom opens up a possibility for better solution.
Operational items are set by the system. They are considered to be read-only for the user. Operational items often need special handling, e.g. a code that computes them. Authorizations may ignore operational items in some cases. Operational items may be stored differently, they may affect data versioning and so on. We need a clean way how to mark operational items in the model.
Elaborate items are often too complex for automatic processing. Ordinary items can be almost always processed by automatic code to render the user interface, to evaluate authorizations and so on. No special code is usually needed. But some items are too complex for that, e.g. items that contain recursive data structures. Usual automatic algorithms fail for such items. Therefore such items are marked as "elaborate" to avoid automatic processing.