Enriching OSDU Objects to simplify search
Enriching the OSDU R1 Metadata structure and data to simplify search for R2
- Under review
Context & Scope
R1 relied heavily on denormalization of the index to improve the usability of search. Some of the denormalization (flattening hierarchical attributes) creates integrity risk in the index; while others simply removed ambiguity. However, even in these cases, the index no longer truly mirrored the metadata structure defined by the data definitions group.
We are looking for an approach that achieves the usability goal; while ensuring the the index and metadata structure remain aligned.
*Schema version is updated, @dmitry-kniazev will update the version based on the scripts.
For R2 will will take advantage of schema and metadata versioning as well as enrichment.
- We will load the original OSDU schema (a version 0.2.0) and metadata with minimal manipulation during ingestion.
- We will enrich the schema ( a version 0.2.1) and metadata by adding computed properties to improve the usability of search.
- We will evaluate the resulting search semantics for both approaches.
- Finally, we will revisit this decision with the EA Architecture and data definitions subcommittees once R2 is completed to evaluate the trade-offs of this implementation as input to how we approach this in the long term.
graph LR style Storage fill:#0F0,stroke:#333,stroke-width:4px subgraph Ingestion S1[/OSDU Schema 0.2.1/] --Store--> Storage S0 --Prepare-->S1 S0[/OSDU Schema 0.2.0/] --Prepare & Store--> Storage[(Storage)] D0[/OSDU Metadata 0.2.0/] --Prepare & Store--> Storage end subgraph Enrichment Storage --Extract---Enrich[Enrich to 0.2.1] Enrich --Produce-->D1[/OSDU Metadata 0.2.1/] D1--Store-->Storage end
The notion of versioning and enrichment are core architectural capabilities. By performing this "enrichment" inside the data platform; we maintain integrity and trace-ability.
For every OSDU record coming in, we will have two versions (original and enriched). This is normal practice
When to revisit
April 1st, 2020
Tradeoff Analysis - Input to decision
Alternatives and implications
- Use the original schema and metadata without enrichment: Has usability issues that heavily rely on training and establishing documented conventions on interpreting
- Perform all the enrichment outside of the data platform: All trace-ability to the original data and transforms is lost
Decision criteria and trade-offs
Initial decision Feb 14,2020, Revisit decision April 1, 2020