ADR Validation Service
Context and Scope
The problem and the solution that we are raising are related to the already submitted issue:
In additions to the scenarios described in the original issue, we see some scenarios when schema / data validation should be enforced. Some of these scenarios require more sophisticated validation approach than the one described in the original issue:
A. Schema Structure Validation scenarios:
Manifest Schema structure validation during Ingestion Manifest file structure should be aligned with the corresponding Schema structure otherwise ingestion fails. It would improve user experience if there is a possibility to get a report with discrepancies between the Manifest file and corresponding Schema structure. In this case a user will know what parts of the Manifest must be corrected to run the Ingestion flow.
Continuous Schema structure change It is likely that a specific WPC Schema structure will change with time. The complexity of change can be different: either 1 attribute can be added \ deleted or multiple. An operator will need to enrich older WPCs (created with the previous version) and create new versions of these WPCs. To simplify the process of needed WPCs selection, validation service can be run to compare older WPCs with the new schema structure and produce a report that will list all found discrepancies.
B. Data Validation scenarios:
Reference SRN checks WPC and Master Data Manifests contain SRN references to Reference Data values (including “fixed” Reference data schemas values). That means that corresponding Reference Data must be ingested into OSDU before Master Data or WPC data are ingested. If a user tries to ingest WPC / Master data when reference value does not exist ingestion should be terminated. User should know what caused ingestion process termination. As a solution a validation check should be implemented on the ingestion step to help a user. Validation should be executed to check whether all Reference data values linked to SRNs in the Manifest are present in OSDU. A user should have an ability to get a report that will tell him which validation checks failed. In this case the user can ingest corresponding Reference data and then proceed with WPC or Master Data installation.
Master Data SRN checks Similarly to the scenario described above, in the best case scenario Master data should be ingested before WPC. However, if Master Data for WPC is unavailable, the ingestion workflow should be configurable: a. Either WPC ingestion should be rejected b. Or ingestion workflow should allow creation of “orphan” WPC (linked Master data doesn’t’ exist in OSDU, but WPC is created) and somehow “tag” properties that miss real SRN values. Enrichment of the “orphan” WPCs should be done later after corresponding Master data is ingested.
Multiple data quality scenarios There is a need to provide a mechanism to do data quality checks in the Manifest file content. (e.g. validation that x, y coordinates in the resource correspond resource geo entity value; not null property value validation etc). These checks can be implemented during Ingestion and post-Ingestion.
Suggested Implementation Approach
Suggested approach is to develop a Validation service that will provide an API contract to validate a virtual object (json Manifest). This will allow a user to run validation rules over stored resource record and over manifest in progress. That gives us a flexibility to run validation on different stages of data lifecycle. Validation API will allow user to:
- Register and store a validation rule
Rules should be configured as a pluggable code. It is up to the individual Operator to create a code for pluggable rules. We can consider supplying several rules out-of-the box. For example, rules #1 (moved) and #2 (moved) described above related to schema structure validation, can be created by the OSDU team. Also, rules #3 (moved) and #4 (moved) related to Reference and Master Data validation can be developed by the OSDU team, but the workflow configuration based on the validation results using these rules should be up to an Operator. It is up to an individual Operator to configure data quality rules.
- Send a validation request
Validation Service calls can be “plugged in” into different OSDU services: Ingestion, Enrichment, pre-Ingestion, etc
- Produce a response with validation rule results
Response can be generated in a different format (must be negotiated with the Data Definitions team):
- Updated original Manifest object: “Extended properties” block in the schema
- Updated original Manifest object: additional attribute can be developed to store property validation result
- Separate json file can be generated Depending on the validation result, additional steps in the data workflow can be taken (e.g. ignore validation results and just store them or trigger another DAG).
Pros of implementing Validation functionality as a service:
- Can work over physical resource record and over manifest that hasn’t been ingested yet
- Validation requests can be sent by Java and Python applications
- Validation checks can be configurable
- Flexibility when validation check has to be applied (Pre-Ingestion, Ingestion, Enrichment etc).
Additional service development and maintenance.