This page captures the Scope, Definition of Done, Horizons, and Milestones for the R3 Manifest Ingestion workflow. Note that everything on this page is a Work In Progress, and nothing is committed or guaranteed. The diagrams below do not yet incorporate the File or Dataset services, but will in the near future. The R3 Manifest Ingestion will deliver data loading capabilities design to meet the initial needs of loading data into OSDU while providing a framework for implementations of more robust ingestion processes.
This page captures the Scope, Definition of Done, Horizons, and Milestones for the R3 Manifest Ingestion workflow. Note that everything on this page is a Work In Progress, and nothing is committed or guaranteed. The diagrams below do not yet incorporate the File or Dataset services, but will in the near future. The R3 Manifest Ingestion will deliver data loading capabilities designed to meet the initial needs of loading data into OSDU while providing a framework for implementations of more robust ingestion processes.
The approach for R3 centers on the following concepts:
* Pre-ingestion work helps ensure well-formed data enters OSDU
* The latest Data Definition Schemas ([v1.0.0](https://community.opengroup.org/osdu/data/data-definitions/-/tree/1bdc6e43858d7f0202316135ee4b9a943a26e297)) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement
* Loading by Manifest ensures the metadata describing the underlying source data adheres to the [Well Known Structure](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation//Entity-and-Schemas/Demystifying-Well-Known-Schemas,-Well-Known-Entities,-Enrichment-Pipelines) concept, a requirement for interoperability and a [promise](https://osduforum.org/about-us/who-we-are/osdu-mission-vision/) of OSDU.
*We must first get the basic metadata into OSDU, but enable more complex workflows capable of building more robust datasets using the source data and capabilities of the platform. This approach preserves the source data while also creating and presenting consumption ready data products.
* Pre-ingestion work helps ensure well-formed data enters OSDU (meaning, work is performed outside of OSDU to create the Manifests - see below for additional details on the Manifest itself).
* The latest Data Definition Schemas ([v1.0.0](https://community.opengroup.org/osdu/data/data-definitions/-/tree/1bdc6e43858d7f0202316135ee4b9a943a26e297)) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement without requiring domain understanding. The Manifest Ingestion process does not have domain context. Subsequent ingestion workflows made possible by the Ingestion Framework support [DDMS](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Domain-&-Data-Management-Services/DDMS-&-Data-Governance) ingestion processes.
* Loading by Manifest ensures the metadata describing the underlying source data adheres to the [Well Known Structure](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation//Entity-and-Schemas/Demystifying-Well-Known-Schemas,-Well-Known-Entities,-Enrichment-Pipelines) concept, a requirement for interoperability and a [promise](https://osduforum.org/about-us/who-we-are/osdu-mission-vision/) of OSDU. While the Manifest Ingestion process focuses on loading metadata described in a Manifest, OSDU R3 allows for the registration of new schemas and the Ingestion Framework enables new ingestion workflows, which empowers others to load data in other formats and in compliance with their data management standards and processes.
*The intent of the Manifest Ingestion is to create a mechanism to load source data in its original format while enabling discovery (index, search, deliver). The Ingestion Framework enables more complex workflows capable of building more robust datasets using the source data through workflows focused on enrichment, parsing, etc. This approach preserves the source data while also creating and presenting consumption ready data products.
## R3 Manifest Ingestion Scope ##
...
...
@@ -22,12 +22,12 @@ The scope for R3 Manifest Ingestion is documented via Ingestion Uses cases found
The picture above depicts the conceptual architecture for the R3 Manifest Ingestion scope. Much of the complexity has been extracted for the sake of simplicity, but the picture hopefully illustrates the intent. We will define scope through the Definition of Done. In short, the following is considered _In-Scope_ for R3.
- Validate (Syntax and Content) and Process the contents of a [Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) into OSDU via the Storage Service. Errors and storage results are returned.
- CSV Ingestion: Present a Manifest file that references a CSV file. Process the Manifest and then trigger an Ingestion Workflow that performs additional processing of the CSV file
- Energistics Ingestion: Present a Manifest file that references an Energistics file. Process the Manifest and then trigger an Ingestion Workflow that performs additional processing of the Energistics file (e.g., WITSML)
- Validate (Syntax and Content) and Process the contents of a [Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) into OSDU via the Storage Service. Errors and storage results are retrievable. Validation should occur within services external to the workflow to allow maximum reusability.
- CSV Ingestion: Present a Manifest file that references a CSV file. Process the Manifest and then trigger an Ingestion Workflow that performs additional processing of the CSV file. To be validated with the CSV ingestion team.
- Energistics Ingestion: Present a Manifest file that references an Energistics file. Process the Manifest and then trigger an Ingestion Workflow that performs additional processing of the Energistics file (e.g., WITSML). To be validated with the Energistics team.
### Definition of Done ###
...
...
@@ -39,45 +39,67 @@ This is a high-level definition of done for the R3 Manifest Ingestion workflow.
- Confirm the calling process is authenticated and authorized to invoke the `submitWithManifest` endpoint
- Verify the Manifest Schema exists within the OSDU instance's Schema Service
- Fetch the Manifest Schema via the OSDU instance Schema Service (the process must be authenticated and authorized to perform this query and fetch)
- Prepare all inputs necessary to invoke the Manifest Ingestion workflow
- Invoke the Manifest Ingestion workflow (the Manifest Ingestion workflow is described below)
- Return a `workflowId` to the process which invoked the Ingestion Service
- Return any errors that may have occurred up until this point
- The Manifest Ingestion workflow, which runs inside of the Ingestion Framework, may perform the following activities:
- Validate the provided Manifest is syntactically correct per the indicated Manifest Schema `kind`
- This process is completed for each Reference Data, Master Data, Work Product, Work Product Component, and File element included
- Where determinable, elements provided with a valid `id` will be checked for existence in OSDU using the Storage Service. If Reference Data or Master Data already exists, an error is generated indicating data duplication
- The validation will also include searching for any supported annotation extensions, such as `x-osdu-relationship` and programmatically validating correctness where possible
- Should validation errors occur, the Ingestion Service will terminate and return those errors to the calling process
- The validation will also include searching for any supported annotation extensions, such as `x-osdu-relationship` and programmatically validate correctness where possible
- Should validation errors occur, the Ingestion Service will terminate and log those errors, which will be retrievable via the `workflowID`
- Invoke the `Storage API` for each record
- This process does not support rollback. Errors that occur during this process may require manual resolution (alternatively, cleanup workflows could be established to handle these situations if the errors are pushed to the Notification Service)
- A failure of one record does not constitute the failure of all contents in the Manifest
- A failure of a parent record will prevent the Ingestion Service from processing the child records (this only applies to Work Product and Work Product Component as the Manifest does not support hierarchical relationships with Reference Data, Master Data, and Files)
- A a part of the write process, `surrogate-key`s, where specified, are resolved to the system assigned `id` created on a successful `createOrUpdateRecords` call
- Once the Manifest file is fully processed, the results of the process are returned to the calling process
- A failure of a parent record will prevent the Ingestion Service from processing the child records (this only applies to Work Product and Work Product Component as the Manifest Schema does not provide `surrogate-key` capabilities with Reference Data, Master Data, and Files)
- As a part of the write process, `surrogate-key`s, where specified, are resolved to the system assigned `id` created on a successful `createOrUpdateRecords` call to the Storage API
- Once the Manifest file is fully processed, the results of the process are logged, which are retrievable via the `workflowId`
- There are two types of notifications that the Manifest Ingestion workflow will trigger:
- Storage Service - by design, the Storage Service will issue notifications on the completion of storing a record to trigger things like Indexing. It is possible to hook into these notifications to trigger workflows.
- Workflow Complete - once a workflow completes, a notification must be issued with the status of the workflow and the type of workflow that was executed. This will enable chaining ingestion workflows together and ensure that all data for a Manifest Ingestion workflow is successfully written before triggering additional workflows to further process the data (e.g., enrichment)
### Out-of-Scope ###
- The design and implementation of the Manifest Ingestion process may require updates to other OSDU services and components within the platform. Where those changes are required, we will submit the required ADRs and work through the required processes to have those items approved and implemented
- Ingestion Workflow capabilities supporting Enrichment, Extraction, Reclassification, parsers [CSV, Energistics], re-processing, etc. The Ingestion Framework supports the implementation of these pipelines, but the Manifest Ingestion team is not responsible for delivering these pipelines
- Bulk loading - another critical component, but we're starting simple. The Manifest does support some concepts of Bulk Loading, though, for R3, we may artificially limit bulk loading via the Manifest file
- Any activity involving the positioning of files or datasets into the OSDU platform - the MVE expects the completion of this step before presenting a Manifest to the Manifest Ingestion Service (i.e., loading Files or Datasets into the platform)
- Any activity involving the positioning of files or datasets into the OSDU platform - the expectation is that the completion of this step occurs before presenting a Manifest to the Manifest Ingestion Service (i.e., loading Files or Datasets into the platform). We may implement some capabilities that position a File in a temporary storage location to its permanent position as part of the Manifest Ingestion workflow
- Any activity involving the creation of the Manifest is outside the scope of R3 Manifest Ingestion
| Day 0 of R3 Manifest Ingestion. Able to submit a manifest that is prepopulated with required data, including `id`s and successfully write the data via the storage service. Basic schema validation occurs. Basic `exists` checks occur for cited data. | Able to process `surrogate-key`s. Integration with the new [Schema Service](https://community.opengroup.org/osdu/platform/system/schema-service). Provide support for Dataset Registry (if available). Provide _hook_ for initiating Ingestion Workflows via published messages from the Storage Service. Integrated testing. | Pre-Release activities. Operational readiness. Solution hardening. |
<table>
<tr>
<thwidth="275">Horizon 1</th>
<thwidth="275">Horizon 2</th>
<thwidth="275">Release</th>
</tr>
<tr>
<td>
[Day 0 of R3 Manifest Ingestion](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/data-prep/docs/-/blob/master/Design%20Documents/Ingestion/Core-Concept-Input_MVE-with-Ingestion-UseCases_Rev-02.pdf). Able to submit a manifest that is prepopulated with required data, including `id`s and successfully write the data via the storage service. Basic schema validation occurs. Basic `exists` checks occur for cited data.
</td>
<td>
Able to process `surrogate-key`s. Integration with the new [Schema Service](https://community.opengroup.org/osdu/platform/system/schema-service). Provide support for Dataset Registry (if available). Provide _hook_ for initiating Ingestion Workflows via published messages from the Storage Service. Integrated testing.
This is our target for _Day 0 or R3 Manifest Ingestion_ (that is, the most basic functionality qualifying as Manifest Ingestion). Able to submit a pre-populated manifest with `id`s specified (vs. `surrogate-key`s) using the 1.0.0 version of the [Schema Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) to the Ingestion Service API endpoint.
This is our target for _[Day 0 or R3 Manifest Ingestion](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/data-prep/docs/-/blob/master/Design%20Documents/Ingestion/Core-Concept-Input_MVE-with-Ingestion-UseCases_Rev-02.pdf)_ (that is, the most basic functionality qualifying as Manifest Ingestion). Able to submit a pre-populated manifest with `id`s specified (vs. `surrogate-key`s) using the 1.0.0 version of the [Schema Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) to the Ingestion Service API endpoint.
- Schema validation for R3 schemas (Master Data, Reference Data, Work Product, Work Product Components, and File)
- Additional content validation capabilities, which includes verifying that cited data exists and data relationships are correct
- Load one and only one manifest at a time (bulk loading is managed externally to ingestion process)
- Additional content validation capabilities, which includes verifying that cited data exists (where derivable via the Schema definitions)
- Load one and only one manifest at a time (bulk loading of Manifests is managed externally to ingestion process)
## Horizon 2 ##
(WIP)
Able to submit a pre-populated manifest to the Ingestion Service with support for `surrogate-key`s to enable on-write resolution of `id`s and construction of Work Product, Work Product Component, File and Dataset relationships.
- Additional content validation capabilities, which includes verifying data relationships are correct per schema definitions
- Able to coordinate writes to the storage service and properly update `surrogate-key`s specified in the Manifest file to preserve relationships when `id`s are not available at manifest generation time
Able to submit a pre-populated manifest to the Ingestion Service with support for `surrogate-key`s to enable on-write resolution of `id`s and construction of Work Product, Work Product Component, File and Dataset relationships
- Additional content validation capabilities, which includes verifying data relationships are correct per Schema definitions
- Able to coordinate writes to the storage service and properly update `surrogate-key`s specified in the Manifest file to preserve relationships when `id`s are not available at Manifest generation time
- Integrated with the new [Schema Service](https://community.opengroup.org/osdu/platform/system/schema-service) to fetch schemas for validation
- Ingestion Workflow "integration" with the Notification Service to trigger Ingestion Workflows