@@ -12,9 +12,9 @@ This page captures the Scope, Definition of Done, Horizons, and Milestones for t
The approach for R3 centers on the following concepts:
* Pre-ingestion work helps ensure well-formed data enters OSDU (meaning, work is performed outside of OSDU to create the Manifests - see below for additional details on the Manifest itself).
* The latest Data Definition Schemas ([v1.0.0](https://community.opengroup.org/osdu/data/data-definitions/-/tree/1bdc6e43858d7f0202316135ee4b9a943a26e297)) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement without requiring domain understanding. The Manifest Ingestion process does not have domain context. Subsequent ingestion workflows made possible by the Ingestion Framework support [DDMS](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Domain-&-Data-Management-Services/DDMS-&-Data-Governance) ingestion processes.
* Loading by Manifest ensures the metadata describing the underlying source data adheres to the [Well Known Structure](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation//Entity-and-Schemas/Demystifying-Well-Known-Schemas,-Well-Known-Entities,-Enrichment-Pipelines) concept, a requirement for interoperability and a [promise](https://osduforum.org/about-us/who-we-are/osdu-mission-vision/) of OSDU. While the Manifest Ingestion process focuses on loading metadata described in a Manifest, OSDU R3 allows for the registration of new schemas and the Ingestion Framework enables new ingestion workflows, which empowers others to load data in other formats and in compliance with their data management standards and processes.
* The intent of the Manifest Ingestion is to create a mechanism to load source data in its original format while enabling discovery (index, search, deliver). The Ingestion Framework enables more complex workflows capable of building more robust datasets using the source data through workflows focused on enrichment, parsing, etc. This approach preserves the source data while also creating and presenting consumption ready data products.
* The latest Data Definition Schemas ([v1.0.0](https://community.opengroup.org/osdu/data/data-definitions/-/tree/1bdc6e43858d7f0202316135ee4b9a943a26e297)) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement without requiring domain understanding. The Manifest Ingestion process does not have domain context. Subsequent ingestion workflows made possible by the Ingestion Framework support [DDMS](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Domain-&-Data-Management-Services/DDMS-&-Data-Governance) ingestion processes. Note that the schemas reflected above represent the Data Definition team's work on defining the Well Know Structure (WKS) format for OSDU data types to promote and encourage interoperability. You can view the latest Data Definitions schemas on the [Data Definitions GitLab site](https://community.opengroup.org/osdu/data/data-definitions).
* Loading by Manifest using the schemas defined by the Data Definitions team ensures the metadata describing the underlying source data adheres to the [Well Known Structure](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation//Entity-and-Schemas/Demystifying-Well-Known-Schemas,-Well-Known-Entities,-Enrichment-Pipelines) concept, which supports interoperability and a [promise](https://osduforum.org/about-us/who-we-are/osdu-mission-vision/) of OSDU. While the Manifest Ingestion process focuses on loading metadata described in a Manifest, OSDU R3 allows for the registration of new schemas and the Ingestion Framework enables new ingestion workflows, which empowers others to load data in other formats and in compliance with their data management standards and processes by writing custom ingestion workflows.
* The intent of the Manifest Ingestion is to create a mechanism to load source data in its original format while enabling discovery (index, search, deliver). The Ingestion Framework enables more complex workflows capable of building more robust datasets using the source data through workflows focused on enrichment, parsing, etc. Approaching ingestion in this manner preserves the source data while also creating and presenting consumption ready data products.
## R3 Manifest Ingestion Scope ##
...
...
@@ -22,49 +22,123 @@ The scope for R3 Manifest Ingestion is documented via Ingestion Uses cases found
The picture above depicts the conceptual architecture for the R3 Manifest Ingestion scope. Much of the complexity has been extracted for the sake of simplicity, but the picture hopefully illustrates the intent. We will define scope through the Definition of Done. In short, the following is considered _In-Scope_ for R3.
- Validate (Syntax and Content) and Process the contents of a [Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) into OSDU via the Storage Service. Errors and storage results are retrievable. Validation should occur within services external to the workflow to allow maximum reusability.
-CSV Ingestion: Present a Manifest file that references a CSV file. Process the Manifest and then trigger an Ingestion Workflow that performs additional processing of the CSV file. To be validated with the CSV ingestion team.
-Energistics Ingestion: Present a Manifest file that references an Energistics file. Process the Manifest and then trigger an Ingestion Workflow that performs additional processing of the Energistics file (e.g., WITSML). To be validated with the Energistics team.
The picture above depicts the conceptual architecture for the R3 Manifest Ingestion scope. Much of the complexity has been extracted for the sake of simplicity, but the picture hopefully illustrates the intent. We will define scope through the Definition of Done. In short, the following is considered _In-Scope_ for R3. The numbers presented in the diagram will receive additional context below. Furthermore, the Manifest Ingestion workflow runs within the Ingestion Framework. Therefore, the architecture above is meant to illustrate the components the Manifest Ingestion workflow depends on and not the architecture designed for the Manifest Ingestion workflow. In other words, the Manifest Ingestion workflow is a tenant of the Ingestion Framework.
- Validate (Syntax and Content) and Process the contents of a [Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) into OSDU via the Storage Service. Validation will occur within the workflow to promote scalability given the workflows are executed asynchronously and a large manifest file could take some time to validate. The validation should be optional in that a process can elect to skip or ignore validation errors. The Manifest Ingestion is meant to represent best practices, but cannot enforce them given it is not meant as a prescriptive data management solution.
-At the completion of the Manifest Ingestion workflow, a notification must be generated indicating that the workflow is complete, allowing other workflows to initiate.
-A proposal to deprecate the Ingestion Service in favor of working directly with the Ingestion Framework.
### Definition of Done ###
This is a high-level definition of done for the R3 Manifest Ingestion workflow. A more detailed [Definition of Done](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/Manifest-Ingestion/R3-MVE-Manifest-Ingestion/R3-MVE-Manifest-Ingestion-Definition-of-Done) is found here.
- A process must present a well-formed and correct Manifest to the Ingestion Service endpoint for processing
- The Ingestion Service must support multiple, simultaneous calls and scaling as required to meet demand
- Scaling limits TBD
- The Ingestion Service may perform the following activities at this stage:
- Confirm the calling process is authenticated and authorized to invoke the `submitWithManifest` endpoint
- Verify the Manifest Schema exists within the OSDU instance's Schema Service
- Fetch the Manifest Schema via the OSDU instance Schema Service (the process must be authenticated and authorized to perform this query and fetch)
- Prepare all inputs necessary to invoke the Manifest Ingestion workflow
- Invoke the Manifest Ingestion workflow (the Manifest Ingestion workflow is described below)
- Return a `workflowId` to the process which invoked the Ingestion Service
- Return any errors that may have occurred up until this point
- The Manifest Ingestion workflow, which runs inside of the Ingestion Framework, may perform the following activities:
- Validate the provided Manifest is syntactically correct per the indicated Manifest Schema `kind`
- This process is completed for each Reference Data, Master Data, Work Product, Work Product Component, and File element included
- Where determinable, elements provided with a valid `id` will be checked for existence in OSDU using the Storage Service. If Reference Data or Master Data already exists, an error is generated indicating data duplication
- The validation will also include searching for any supported annotation extensions, such as `x-osdu-relationship` and programmatically validate correctness where possible
- Should validation errors occur, the Ingestion Service will terminate and log those errors, which will be retrievable via the `workflowID`
- Invoke the `Storage API` for each record
- This process does not support rollback. Errors that occur during this process may require manual resolution (alternatively, cleanup workflows could be established to handle these situations if the errors are pushed to the Notification Service)
- A failure of one record does not constitute the failure of all contents in the Manifest
- A failure of a parent record will prevent the Ingestion Service from processing the child records (this only applies to Work Product and Work Product Component as the Manifest Schema does not provide `surrogate-key` capabilities with Reference Data, Master Data, and Files)
- As a part of the write process, `surrogate-key`s, where specified, are resolved to the system assigned `id` created on a successful `createOrUpdateRecords` call to the Storage API
- Once the Manifest file is fully processed, the results of the process are logged, which are retrievable via the `workflowId`
- There are two types of notifications that the Manifest Ingestion workflow will trigger:
- Storage Service - by design, the Storage Service will issue notifications on the completion of storing a record to trigger things like Indexing. It is possible to hook into these notifications to trigger workflows.
- Workflow Complete - once a workflow completes, a notification must be issued with the status of the workflow and the type of workflow that was executed. This will enable chaining ingestion workflows together and ensure that all data for a Manifest Ingestion workflow is successfully written before triggering additional workflows to further process the data (e.g., enrichment)
In it's simplest form, the Manifest Ingestion Workflow works as follows:
- A well-formed Manifest is created externally to OSDU and presented to the Ingestion Workflow (via the `startWorkflow` endpoint in the `os-workflow` module)
- The workflow will perform optional validation against the schema (syntax, structure, verifying reference data exists, determining relationship conformity where possible, etc.)
- After validation (if successful or if validation is optionally skipped), the workflow will invoke the Storage API to save the records, which will in turn trigger indexing enabling search
- The Manifest Ingestion Workflow must be Cloud Service Provider (CSP) agnostic
There are three primary phases the Manifest Ingestion Process undergoes:
- Syntax Validation - This stage ensures the Manifest is structurally (syntactically) correct based on the referenced schemas (identified by the `kind` property). Schema Validation occurs as follows:
1. The submitted Manifest file is validated against the registered Manifest Schema within OSDU based on the submitted Manifest's `kind`
2. The Manifest has embedded data that have their own Schemas, which are also identified by their `kind`. The Schema Validation process will fetch and validate each component of the Manifest that has a specified `kind`. If any of the validations fail, errors are logged and the Manifest Ingestion workflow is terminated unless the invoking process opted to ignore errors.
- Pre-Pass - Here, the Manifest content is checked for intent. The OSDU schema definitions indicate additional information that permits additional validation without having knowledge of the data's domain. For example, cited data where an `id` is present, suggests the data should exist in the destination OSDU instance. The validation will ensure that data exists. Additionally, if the data is Reference Data or Master Data and a valid `id` is presented and the data exists, the validation will fail the Manifest to avoid duplication. The `x-osdu-relationship` annotation of OSDU schemas also informs relationships, which can be validated programmatically. If the submitting process opts to ignore errors, then any validation errors encountered will be logged, but the Manifest Ingestion workflow will not be terminated.
- Process - At this stage, the data has been validated, and either no errors were discovered, or the submitting process opted to ignore the errors. Version 1.0.0 of the OSDU schemas leverage `surrogate-key`s to represent relationships between the data elements prior to persisting the data as the `id`s are unknown when the Manifest is created. The Process stage will replace the `surrogate-key` instances with generated `id`s to ensure referential integrity is maintained. Note that this stage will create the `id`s within the workflow vs. letting the Storage service create the `id`s. This is done to reduce the complexity in determining the correct graph order to write all the presented data. The `id`s will be created as per the pattern defined in the schemas so as to be consistent with the platform. If the desire is to have the system create the `id`s, then leverage multiple Manifests and an external process to govern the order of the writes, or have the external process invoke the Storage API directly.
The intent of the validation checks is to minimize the work required to manually address data loading errors should they occur. Release 3 of OSDU does not have a rollback mechanism. The submitting process may opt to ignore errors, but doing so may result in manual efforts to resolve data issues.
As data is successfully persisted by the Storage Service, notifications are generated. A process within the Ingestion Framework may subscribe to the Notifications to initiate Ingestion Workflows that take additional action on the data. Additionally, a notification is also generated on the completion of a workflow. A Workflow Trigger will subscribe to these notifications enabling it to initiate additional workflows as configured. The trigger capability is likely to come post R3.
### How Manifest Ingestion will work ###
The following diagram illustrates the workflow and the sequence of steps, which are further described below.
NOTE: Where "Manifest Ingestion Workflow" is referenced, we are referring to the Directed Acyclic Graph (DAG) responsible for processing the Manifest data. This DAG contains one or more operators that process the data in some manner. The collection of operators makes a DAG and the DAG is what the Ingestion Workflow Service will execute.
### 1. Initiating Ingestion ###
The Manifest Ingestion process is initiated via the Ingestion Workflow Service (the module name is `os-workflow` and the API endpoint is called `startWorkflow`).
_Requirements_
- The Workflow Service should ensure the calling process is authenticated and authorized to initiate a workflow
- Workflows must be discoverable via a registry using a unique name to allow maximum flexibility and reusability
- The caller can provide a workflow payload that either contains the manifest payload or a standardized structure containing a pointer to the payload that was pre-loaded to storage (think Datasets or Files and passing references to data vs. passing the data itself)
In R2, ingestion was initiated via the Ingestion Service. Given improvements in the Ingestion Framework (`os-workflow`), it is proposed that we deprecate the Ingestion Service. As such, this approach does not recommend initiating a Manifest Ingestion Workflow via the Ingestion Service.
### 2. Initiating the Workflow ###
The Workflow Service contained within the Ingestion Framework is capable of initiating a workflow. OSDU R3 leverages Apache Airflow for workflow execution. On initiating a workflow, a `workflowId` is created, which may be used to fetch the status of the workflow from the Ingestion Workflow Service (i.e., `WorkflowStatusAPI.getWorkflowStatus`). The workflow is the mechanism by which manifest data is processed into OSDU.
_Requirements_
- The Ingestion Framework should validate the named workflow exists within the workflow registry and throw an error if the named workflow is not found
- If the named workflow does exist within the registry, the Ingestion Workflow Service must initiate the workflow with the presented payload
- On initiating a workflow, a unique `workflowId` must be created and returned to the calling process
- Whatever payload was provided to the Ingestion Workflow Service must be presented to the initiated workflow for processing
### 3. Executing the ingestion workflow ###
The Manifest Ingestion workflow is a default ingestion workflow capable of processing version 1.0.0 of the Manifest Schema, which may contain Master Data, Reference Data, Work Work Product, Work Product Components, or Files (note here that "File" is a superset of data sources that is capable of representing Datasets - see the [Dataset as Core Service ADR](https://community.opengroup.org/osdu/platform/system/home/-/issues/65#register-pane) for more information).
The intent of the Manifest Ingestion Workflow is to provide out-of-the-box capability within OSDU to store and define source data in its original format while making that data searchable and discoverable via the metadata provided in the manifest. In other words, load source data (e.g., a dataset) to OSDU. Create a manifest file. Leverage the Manifest Ingestion Workflow to submit the Manifest to provide metadata about the source data to OSDU. That metadata is validated and then stored, which triggers indexing processes that make the source data discoverable via defined metadata in the manifest.
During this step, the validation process outlined above occurs. There are multiple validation requirements, which are defined in the table below.
_Requirements_
- Obtain the manifest payload, either through the payload presented to the workflow or via accessing the referenced Dataset (see Step #1 for more details)
- Validate the manifest payload per the validation rules listed below
- Any validation errors must be logged and will result in the termination of the workflow
- Schema validation occurs by fetching from the Schema Service the schema for each `kind` presented in the manifest payload
| Type | Validation rule | Description |
| ---- | --------------- | ----------- |
| Syntax | Syntax check | Ensure the full schema and all referenced schemas adhere to the defined schemas registered with the Schema Service based on `kind`. This includes structure, syntax, mandatory fields, unknown attributes, and attribute pattern adherence. |
| Pre-Pass | Cited Data Exists | Ensure referenced items that have reference id property field populated, exist within OSDU |
| Pre-Pass | Duplication | Ensure presented Master and Reference data that has the `id` field populated does not already exist. This is to prevent duplication. |
| Pre-Pass | Valid Hierarchy | Leveraging the `x-osdu-relationship` definition within the OSDU schemas, ensure presented data that his hierarchical in nature adheres to the relationships defined within the referenced schema. |
| Pre-Pass | Surrogate Keys | Ensure the use of `surrogate-keys` is consistent and accurate. Ensure all `surrogate-key` references to a parent entity are resolved within the manifest (i.e., no orphaned `surrogate-keys`). |
### 4. Supporting Services ###
Provided for illustrative purposes, but as part of the Schema Validation, the Manifest Ingestion Workflow will fetch Schemas from the Schema Service for the `kinds` identified in the manifest payload.
### 5. Storage Service ###
The Manifest Ingestion Workflow will invoke the `RecordAPI.createOrUpdateRecords` API endpoint within the Storage service for data that passes validation.
_Requirements_
- The process may opt to present all records to the Storage Service at once or it may present them at a time or in smaller batches.
- Note that the Storage Service does not support transactions, so rollbacks are not possible. This is the reason for the upfront validation checks to help reduce the manual work required in backing out partially stored manifests.
- A failure of record will not constitute the failure of all contents in the Manifest.
- A failure of a parent record will prevent the Ingestion Service from processing the child records.
- Responses from the Storage API should be logged where appropriate.
### 6. & 7. - Notification Service ###
There are two instances in which the Notification Service is invoked via processes performed by the manifest ingestion workflow.
1. When a record is successfully created via the Storage endpoint, notifications are generated that trigger other services (e.g., Indexing). A downstream process could register for this notification and then take action. For example, loading a Seismic Trace with an underlying SegY file. Doing so might trigger a downstream ingestion enrichment workflow that creates an OpenVDS artefact for the SegY file so the Seismic Trace is now represented by its original SegY file and now via the OpenVDS artefacts.
2. When the manifest ingestion workflow completes, it will send a message to the notification service indicating it has completed processing a manifest. This will enable downstream processes to kick-off if additional processing is needed (think enrichment, quality checks, etc.). This feature may not be fully supported in the OSDU 3.0 (R3), but may come as a minor feature later (i.e., 3.x).
### 8. Event-Driven Workflows ###
This feature may not be supported in the OSDU 3.0 (R3), but may come as a minor feature later (i.e., 3.x).
While this may be out of scope for R3, the Notification Service may be used to indicate the status of ingestion activities. These statuses could indicate to registers that additional workflows should be initiated. The intent is to allow for a chain of workflows to initiate to process incoming records without having to tightly couple those workflows together. This is an event-driven architecture design.
### 9. Post-Manifest Workflows ###
As mentioned above, the Ingestion Framework will support multiple ingestion workflows. These additional workflows will take on tasks such as enrichment, quality checking, insight extraction, analytics, and more. It's important that the framework allow OSDU consumers the ability to author their own workflows and specify how those workflows are initiated (manually, event-driven, other workflows, etc.).
### Out-of-Scope ###
- The design and implementation of the Manifest Ingestion process may require updates to other OSDU services and components within the platform. Where those changes are required, we will submit the required ADRs and work through the required processes to have those items approved and implemented
- Ingestion Workflow capabilities supporting Enrichment, Extraction, Reclassification, parsers [CSV, Energistics], re-processing, etc. The Ingestion Framework supports the implementation of these pipelines, but the Manifest Ingestion team is not responsible for delivering these pipelines
- Bulk loading - another critical component, but we're starting simple. The Manifest does support some concepts of Bulk Loading, though, for R3, we may artificially limit bulk loading via the Manifest file
- Any activity involving the positioning of files or datasets into the OSDU platform - the expectation is that the completion of this step occurs before presenting a Manifest to the Manifest Ingestion Service (i.e., loading Files or Datasets into the platform). We may implement some capabilities that position a File in a temporary storage location to its permanent position as part of the Manifest Ingestion workflow
- Any activity involving the creation of the Manifest is outside the scope of R3 Manifest Ingestion
- The design and implementation of the Manifest Ingestion process may require updates to other OSDU services and components within the platform. Where those changes are required, we will submit the required ADRs and work through the required processes to have those items approved and implemented by the identified teams who own the delivery of those services.
- Ingestion Workflow capabilities supporting Enrichment, Extraction, Reclassification, parsers [CSV, Energistics], re-processing, etc. The Ingestion Framework supports the implementation of these pipelines, but the Manifest Ingestion team is not responsible for delivering these pipelines in R3.
- Bulk loading - another critical component, but we're starting simple. The Manifest does support some concepts of Bulk Loading, though, for R3, we may artificially limit bulk loading via the Manifest file via limits imposed on collection sizes within the Manifest.
- Any activity involving the positioning of files or datasets into the OSDU platform - the expectation is that the completion of this step occurs before presenting a Manifest to the Manifest Ingestion Workflow (i.e., loading Files or Datasets into the platform).
- Any activity involving the creation of the Manifest is outside the scope of R3 Manifest Ingestion.
## Potential Roadmap for Manifest Ingestion ##
...
...
@@ -89,6 +163,8 @@ This is a high-level definition of done for the R3 Manifest Ingestion workflow.
## Horizon 1 ##
(WIP)
Due Milestone 3.
This is our target for _[Day 0 or R3 Manifest Ingestion](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/data-prep/docs/-/blob/master/Design%20Documents/Ingestion/Core-Concept-Input_MVE-with-Ingestion-UseCases_Rev-02.pdf)_ (that is, the most basic functionality qualifying as Manifest Ingestion). Able to submit a pre-populated manifest with `id`s specified (vs. `surrogate-key`s) using the 1.0.0 version of the [Schema Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) to the Ingestion Service API endpoint.
- Schema validation for R3 schemas (Master Data, Reference Data, Work Product, Work Product Components, and File)
- Additional content validation capabilities, which includes verifying that cited data exists (where derivable via the Schema definitions)
...
...
@@ -97,6 +173,8 @@ This is our target for _[Day 0 or R3 Manifest Ingestion](https://gitlab.opengrou
## Horizon 2 ##
(WIP)
Due Milestone 4.
Able to submit a pre-populated manifest to the Ingestion Service with support for `surrogate-key`s to enable on-write resolution of `id`s and construction of Work Product, Work Product Component, File and Dataset relationships
- Additional content validation capabilities, which includes verifying data relationships are correct per Schema definitions
- Able to coordinate writes to the storage service and properly update `surrogate-key`s specified in the Manifest file to preserve relationships when `id`s are not available at Manifest generation time