Updated to reflect latest design decisions. authored by Alan Henson's avatar Alan Henson
......@@ -8,57 +8,44 @@
## Overview ##
This page captures the Scope, Definition of Done, Horizons, and Milestones for the R3 Manifest Ingestion workflow. Note that everything on this page is a Work In Progress, and nothing is committed or guaranteed. The diagrams below do not yet incorporate the File or Dataset services, but will in the near future. The R3 Manifest Ingestion will deliver data loading capabilities designed to meet the initial needs of loading data into OSDU while providing a framework for implementations of more robust ingestion processes.
This page captures the Scope, Definition of Done, Design, Development Horizons, and Milestones for the R3 Manifest Ingestion workflow. Note that everything on this page is a Work In Progress, and nothing is committed or guaranteed. The diagrams below do not yet incorporate the File or Dataset services, but will in the near future. The R3 Manifest Ingestion will deliver data loading capabilities designed to meet the initial needs of loading data into OSDU while providing a framework for implementations of more robust ingestion processes.
The approach for R3 centers on the following concepts:
* Pre-ingestion work helps ensure well-formed data enters OSDU (meaning, work is performed outside of OSDU to create the Manifests - see below for additional details on the Manifest itself).
* The latest Data Definition Schemas ([v1.0.0](https://community.opengroup.org/osdu/data/data-definitions/-/tree/1bdc6e43858d7f0202316135ee4b9a943a26e297)) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement without requiring domain understanding. The Manifest Ingestion process does not have domain context. Note that the schemas reflected above represent the Data Definition team's work on defining the Well Know Structure (WKS) format for OSDU data types to promote and encourage interoperability. You can view the latest Data Definitions schemas on the [Data Definitions GitLab site](https://community.opengroup.org/osdu/data/data-definitions).
* Loading by Manifest using the schemas defined by the Data Definitions team ensures the metadata describing the underlying source data adheres to the [Well Known Structure](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation//Entity-and-Schemas/Demystifying-Well-Known-Schemas,-Well-Known-Entities,-Enrichment-Pipelines) concept, which supports interoperability and a [promise](https://osduforum.org/about-us/who-we-are/osdu-mission-vision/) of OSDU. While the Manifest Ingestion process focuses on loading metadata described in a Manifest, OSDU R3 allows for the registration of new schemas and the Ingestion Framework enables new ingestion workflows, which empowers others to load data in other formats and in compliance with their data management standards and processes by writing custom ingestion workflows.
* The intent of the Manifest Ingestion is to create a mechanism to load source data in its original format while enabling discovery (index, search, deliver). The Ingestion Framework enables more complex workflows capable of building more robust datasets using the source data through workflows focused on enrichment, parsing, etc. Approaching ingestion in this manner preserves the source data while also creating and presenting consumption ready data products.
* The latest [Data Definition Schemas](https://community.opengroup.org/osdu/data/data-definitions) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement without requiring domain understanding. The Manifest Ingestion process lacks an understanding of the data itself, which is by design. DDMS concepts address domain-specific needs of data. Note that the schemas referenced above represent the Data Definition team's effort in defining the Well Know Structure (WKS) format for OSDU data types to promote and encourage interoperability.
* Loading by Manifest using the schemas defined by the Data Definitions team ensures the metadata describing the underlying source data adheres to the [Well Known Structure](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation//Entity-and-Schemas/Demystifying-Well-Known-Schemas,-Well-Known-Entities,-Enrichment-Pipelines) concept, which supports interoperability and a [promise](https://osduforum.org/about-us/who-we-are/osdu-mission-vision/) of OSDU. While the Manifest Ingestion process focuses on loading metadata described in a Manifest, OSDU R3 allows for the registration of new schemas and the OSDU Workflow Service enables new ingestion workflows, which empowers others to build ingestion pipelines in compliance with their data management standards and processes.
* The intent of the Manifest Ingestion is to create a mechanism to load source data in its original format while enabling discovery (index, search, deliver). A critical component of establishing data lineage is to bring data in as-is and build new representations of that data for additional needs.
Still need:
- Info on E&O
- Testing
- Reporting
- Admin UI
## R3 Manifest Ingestion Scope ##
The scope for R3 Manifest Ingestion is documented via Ingestion Uses cases found [here](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/data-prep/docs/-/blob/master/Design%20Documents/Ingestion/Core-Concept-Input_MVE-with-Ingestion-UseCases_Rev-02.pdf). For more details, see the _In Scope_ and _Out-of-Scope_ sections below.
### In-Scope ###
### Definition of Done ###
![R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/uploads/e681ce98b53834e13598c85a068bb048/R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple.png)
![R3_Ingestion_Workflows-Ingestion_Service_Workflow](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/uploads/e97c2b1574f4b47aeeaa2c87e5cd936b/R3_Ingestion_Workflows-Ingestion_Service_Workflow.png)
The picture above depicts the conceptual architecture for the R3 Manifest Ingestion scope. Much of the complexity has been extracted for the sake of simplicity, but the picture hopefully illustrates the intent. We will define scope through the Definition of Done. In short, the following is considered _In-Scope_ for R3. The numbers presented in the diagram will receive additional context below. Furthermore, the Manifest Ingestion workflow runs within the Ingestion Framework. Therefore, the architecture above is meant to illustrate the components the Manifest Ingestion workflow depends on and not the architecture designed for the Manifest Ingestion workflow. In other words, the Manifest Ingestion workflow is a tenant of the Ingestion Framework.
- Validate (Syntax and Content) and Process the contents of a [Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) into OSDU via the Storage Service. Validation will occur within the workflow to promote scalability given the workflows are executed asynchronously and a large manifest file could take some time to validate. Some of the validation might be optional in that a process can elect to skip or ignore the validation step through recomposition of the workflow data operators. The intent is to preserve the integrity of the platform without prescribing data management practices.
- At the completion of the Manifest Ingestion workflow, a notification must be generated indicating that the workflow is complete, allowing other workflows to initiate. This capability must come post-R3.
### Definition of Done ###
In it's simplest form, the Manifest Ingestion Workflow works as follows:
- A well-formed Manifest is created externally to OSDU and presented to the Ingestion Service (`os-service` module). The Ingestion Service serves as a common entry point that offers structure to forming an ingestion request.
- Some lightweight validation may occur at this step, such as ensuring the request is well-formed. If any errors occur, an exception will be thrown. If the request is properly structured, then the Ingestion Service will look to the Ingestion Workflow (`os-workflow`) to initiate the appropriate workflow for the manifest ingestion process.
- The workflow first ensures the manifest is syntactically correct. Then it ensures the content (or intent) is correct. Then the workflow persists the manifest. Any validation errors should result in the termination of the workflow.
- The Manifest Ingestion Workflow must be Cloud Service Provider (CSP) agnostic
There are four primary phases the Manifest Ingestion Process undergoes:
- Payload Resolution - The payload was either submitted to the Ingestion Service or a pointer to the payload was provided.
- Syntax Validation - This stage ensures the Manifest is structurally (syntactically) correct based on the referenced schemas (identified by the `kind` property). Schema Validation occurs as follows:
1. The submitted Manifest file is validated against the registered Manifest Schema within OSDU based on the submitted Manifest's `kind`
2. The Manifest has embedded data that have their own Schemas, which are also identified by their `kind`. The Schema Validation process will fetch and validate each component of the Manifest that has a specified `kind`. If any of the validations fail, errors are logged and the Manifest Ingestion workflow is terminated unless the invoking process opted to ignore errors.
- Pre-Pass - Here, the Manifest content is checked for intent. The OSDU schema definitions indicate additional information that permits additional validation without having knowledge of the data's domain. For example, cited data where an `id` is present, suggests the data should exist in the destination OSDU instance. The validation will ensure that data exists. Additionally, if the data is Reference Data or Master Data and a valid `id` is presented and the data exists, the validation will fail the Manifest to avoid duplication. The `x-osdu-relationship` annotation of OSDU schemas also informs relationships, which can be validated programmatically. If the submitting process opts to ignore errors, then any validation errors encountered will be logged, but the Manifest Ingestion workflow will not be terminated.
- Process - At this stage, the data has been validated, and either no errors were discovered, or the submitting process opted to ignore the errors. Version 1.0.0 of the OSDU schemas leverage `surrogate-key`s to represent relationships between the data elements prior to persisting the data as the `id`s are unknown when the Manifest is created. The Process stage will replace the `surrogate-key` instances with generated `id`s to ensure referential integrity is maintained. Note that this stage will create the `id`s within the workflow vs. letting the Storage service create the `id`s. This is done to reduce the complexity in determining the correct graph order to write all the presented data. The `id`s will be created as per the pattern defined in the schemas so as to be consistent with the platform. If the desire is to have the system create the `id`s, then leverage multiple Manifests and an external process to govern the order of the writes, or have the external process invoke the Storage API directly.
- The Ingestion Service will receive an update to support R3 schema definitions.
- This [ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-service/-/issues/30) currently depicts the approved changes.
- This [OpenAPI Spec](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-service/-/blob/refactoring_ingest/docs/api/openapi.ingestion.yaml) captures the API endpoints.
- If the request is properly structured (i.e., the API was correctly invoked with the required parameters), then the Ingestion Service will leverage the Workflow Service (`os-workflow`) to initiate the appropriate workflow for the manifest ingestion process.
- The workflow first ensures the manifest is syntactically correct. Then it ensures the content (or intent) is correct. Then the workflow persists the metadata provided in the manifest. Any validation errors should result in the termination of the workflow. These steps are covered in much more detail below.
- The Manifest Ingestion Workflow must be Cloud Service Provider (CSP) agnostic capable of running within any certified OSDU R3 environment.
The intent of the validation checks is to minimize the work required to manually address data loading errors should they occur. Release 3 of OSDU does not have a rollback mechanism.
### The Manifest Ingestion Workflow ###
### The Manifest Ingestion Process ###
The Manifest Ingestion Workflow is implemented as a DAG (Directed Acyclic Graph) that executes within an [Apache Airflow](https://airflow.apache.org/) environment. The Manifest Ingestion DAG is composed of DAG operators or individual steps that perform some function. These DAG operators are designed to be Airflow and cloud-platform agnostic, allowing them to run within any properly configured Python 3.6 environment. In the diagram above, each "circle" represents a DAG Operator.
The following diagram illustrates the workflow and the sequence of steps, which are further described below.
NOTE: Where "Manifest Ingestion Workflow" is referenced, we are referring to the DAG.
![R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/uploads/e681ce98b53834e13598c85a068bb048/R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple.png)
Each DAG operator performs a specific function as part of the validation and storage processes.
- Syntax Validation - This stage ensures the Manifest is structurally (syntactically) correct based on the referenced schemas (identified by the `kind` property). Schema Validation occurs as follows:
1. The submitted Manifest file is validated against the registered Manifest Schema within OSDU based on the submitted Manifest's `kind`.
2. The Manifest Schema Definition has elements for a Master Data, Reference Data, Work Product, Work Product Components, and Datasets. For each item provided within these elements, the validation will fetch the schema definition for the element's provided `kind` and validate the provided data against it.
- Cited Data Check (Allow or Reject) - The metadata provided may reference other data. Where it does, the validation will check that the referenced data already exists within OSDU. There are two versions of this operator provided. One that will reject manifest entries that reference data that does not exist, and one that will allow that data to proceed in the workflow. By default, the "Reject" version of the Cited Data Check DAG operator is provided. However, to avoid prescribing data management policy, the same operator is provided to allow a reconfiguration of the workflow to allow data through that references data that does not exist.
- Dataset checks - perform validation that properties required for processing datasets into OSDU are present and accurate. There is more detail coming for this operator as we spec it out.
- Surrogate-key resolution - Version 1.0.0 of the OSDU schemas leverage `surrogate-key`s to allow relationships to be established before `id`s have been generated. This operator will ensure all `surrogate-key`s for datasets are represented within the Work Product Component dataset collection (e.g., `data.Datasets` property).
- Process - At this stage, the data has been validated, and either no errors were discovered, or the submitting process opted to ignore the errors. The process step orchestrates writing data to the Storage Service resolving the `id`s for the `surrogate-key`s.
NOTE: Where "Manifest Ingestion Workflow" is referenced, we are referring to the Directed Acyclic Graph (DAG) responsible for processing the Manifest data. This DAG contains one or more operators that process the data in some manner. The collection of operators makes a DAG and the DAG is what the Ingestion Workflow Service will execute.
The intent of the validation checks is to minimize the work required to manually address data loading errors should they occur. Release 3 of OSDU does not have a rollback mechanism.
### Open Question ###
1. How does a re-process manifest ingest request change the behavior of the ingestion? Theoretically, the validation might be different in that objects with an `id` value specified must already exist within the platform. In an initial processing request, specified `id` values assume that an external process has determined the `id` of the entity vs. letting the Storage service make that determination.
......@@ -71,88 +58,75 @@ The Manifest Ingestion process is initiated via the Ingestion Service (the modul
_Requirements_
- The Ingestion Service should ensure the calling process is authenticated and authorized to perform ingestion.
- The caller can provide a workflow payload that either contains the manifest payload or a standardized structure containing a pointer to the payload that was pre-loaded to storage (think Datasets or Files and passing references to data vs. passing the data itself).
- The call can also provide the name of a workflow if a workflow other than the default workflow should be initiated for manifest ingestion. This allows greater flexibility in ingesting data.
- The Ingestion Service should validate the request is properly structured and correct.
- The caller can provide a workflow payload that either contains the manifest payload or a standardized structure containing a pointer to the payload that was pre-loaded to storage (think Datasets or Files and passing references to data vs. passing the data itself). Initial R3 will require the manifest as a payload.
- The Ingestion Service should validate the request is properly structured.
- Any errors should produce an exception that is thrown or logged and the process should terminate. For R3 we will not support robust batch processing allowing for partial failures. We do not have a good rollback mechanism so we must protect the integrity of the platform by detecting errors early and reducing manual work to resolve partial writes.
- If security and initial validation checks pass, the Ingestion Service should invoke the correct workflow via the Ingestion Workflow Service.
- A successfully initiated workflow via the Ingestion Workflow Service will produce a `workflowId`, which should be returned to the Ingestion Service caller to enable workflow status queries.
- If security and initial validation checks pass, the Ingestion Service should invoke the default workflow (by name) via the Ingestion Workflow Service.
- A successfully initiated workflow via the Ingestion Workflow Service will produce a `WorkflowRunID`, which will be returned to the Ingestion Service caller to enable workflow status queries.
### 2. Initiating the Workflow ###
The Workflow Service contained within the Ingestion Framework is capable of initiating a workflow. OSDU R3 leverages Apache Airflow for workflow execution. On initiating a workflow, a `workflowId` is created, which may be used to fetch the status of the workflow from the Ingestion Workflow Service (i.e., `WorkflowStatusAPI.getWorkflowStatus`).
The Workflow Service initiates named workflows. OSDU R3 leverages Apache Airflow for workflow execution. On initiating a workflow, a `WorkflowRunID` is created, which may be used to fetch the status of the workflow from the Workflow Service (note there is an open [ADR](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/issues/71) with this [OpenAPI Spec](https://community.opengroup.org/osdu/platform/data-flow/ingestion/ingestion-workflow/-/blob/refactoring_workflow/docs/api/openapi.workflow.yaml) that will change the APIs for Workflow Service).
_Requirements_
- Workflows must be discoverable via a registry using a unique name to allow maximum flexibility and reusability.
- The Ingestion Framework should validate the named workflow exists within the workflow registry and throw an error if the named workflow is not found.
- If the named workflow does exist within the registry, the Ingestion Workflow Service must initiate the workflow with the presented payload.
- On initiating a workflow, a unique `workflowId` must be created and returned to the calling process.
- The Workflow Service should validate the named workflow exists within the workflow registry and throw an error if the named workflow is not found.
- If the named workflow does exist within the registry, the Workflow Service must initiate the workflow with the presented payload.
- On initiating a workflow, a unique `WorkflowRunID` must be created and returned to the calling process.
- The payload provided to the Ingestion Workflow Service must be presented to the initiated workflow for processing.
The Manifest Ingestion workflow is a default ingestion workflow capable of processing version 1.0.0 of the Manifest Schema, which may contain Master Data, Reference Data, Work Work Product, Work Product Components, or Files (note here that "File" is a superset of data sources that is capable of representing Datasets - see the [Dataset as Core Service ADR](https://community.opengroup.org/osdu/platform/system/home/-/issues/65#register-pane) for more information).
The intent of the Manifest Ingestion Workflow is to provide out-of-the-box capability within OSDU to store and define source data in its original format while making that data searchable and discoverable via the metadata provided in the manifest.
Steps 3 - 6 occur within a DAG and each step represents a separate DAG Operator.
### 3. Payload Resolution ###
The Manifest Ingestion workflow is a default ingestion workflow capable of processing version 1.0.0 of the Manifest Schema, which may contain Master Data, Reference Data, Work Work Product, Work Product Components, or Dataset (see the [Dataset as Core Service ADR](https://community.opengroup.org/osdu/platform/system/home/-/issues/65#register-pane) for more information).
This is a placeholder for a potential operator that could resolve the manifest using a pointer to a dataset that is the manifest itself. In this scenario, a process created the manifest, stored it via the dataset service (in the delivery service), and obtained a dataset id. The dataset id was then presented to the Ingestion Service for processing as a reference to the manifest to be ingested vs. passing the full manifest itself. This would allow data to be passed by reference rather than by value.
### 4. Syntax Checking ###
### 3. Syntax Checking ###
The first validation to occur is Syntax Checking. This validation leverages the schema definitions to ensure the manifest content adheres to the schema definitions of both the manifest and the data it contains. There are two primary syntax checks:
1. The manifest itself is validated against the manifest schema definition.
2. The contents of the manifest reference schemas using the `kind` property. Because the manifest reference Master Data, Reference Data, Work Product, Work Product Components, and data containers (identified by the `Files` property), there will be additional schemas referenced that must also be verified against their schema definition.
2. The contents of the manifest reference schemas using the `kind` property. Because the manifest contains Master Data, Reference Data, Work Product, Work Product Components, and data containers (identified by the `Dataset` property), there will be additional schemas referenced that must also be verified against their schema definition.
_Requirements_
- Validate the manifest against its schema definition by fetching the schema from the Schema Service using the manifest's `kind` property.
- Validate the manifest payload per the validation rules listed below.
- Any validation errors must be logged and will result in the termination of the workflow.
- Schema validation occurs by fetching from the Schema Service the schema for each `kind` presented in the manifest payload. This requires a full traversal of the manifest content to find each object with a specified `kind` property.
- Schema validation occurs by fetching from the Schema Service the schema for each Master, Reference, Work Product, Work Product Component, and Dataset element presented in the manifest payload.
| Validation Rule | Description | Required? |
| --------------- | ----------- | --------- |
| Manifest Syntax check | Fetch the schema definition from the Schema Service for the `kind` property of the manifest. Validate that the entire manifest is correct according to its schema validation. If the `kind` does not map to a Schema Definition, throw/log an error and terminate the workflow. | Yes |
| Content Syntax Validation | Traverse the manifest for `kind` properties. For each `kind` property found, retrieve the schema definition from the Schema Service. Validate the entire object containing the `kind` property against the schema returned by the Schema Service for that `kind`. If the Schema Service is unable to find a Schema Definition for a given `kind` throw/log an error and terminate the workflow. | Yes |
| Unknown attributes | The validation should ensure only those attributes definition with the schema's definition are present. A schema may have a `data.ExtensionProperties` property, which is where undefined attributes should go. However, the schema validation by nature will pass this section if unknown attributes exist because the schema definition allows for it. Other, unknown attributes outside of this section must generate a validation error and terminate the workflow. | Yes |
| Valid Hierarchy | You might expect this validation rule to exist in the pre-pass, but the schema definitions allow us to get this rule for free as part of the property format validation. Take the [WellLog.1.0.0.json](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Generated/work-product-component/WellLog.1.0.0.json) schema definition as an example. A Well Log has an optional `resourceHomeRegionID` property. If the property is specified, then it must follow the format `^[\\w\\-\\.]+:reference-data\\/OSDURegion:.+:[0-9]*$`, which means any value referencing a type other than an OSDURegion will fail. We can therefore perform the validation check without requiring knowledge of the data's domain. | Yes |
| Content Syntax Validation | For each Master, Reference, Work Product, Work Product Component, and Dataset presented, leverage the `kind` property to fetch the schema definition for that element and then validate that element against its registered schema definition. If the Schema Service is unable to find a Schema Definition for a given `kind` throw/log an error and terminate the workflow. | Yes |
| ~~Unknown attributes~~ | ~~The validation should ensure only those attributes definition with the schema's definition are present. A schema may have a `data.ExtensionProperties` property, which is where undefined attributes should go. However, the schema validation by nature will pass this section if unknown attributes exist because the schema definition allows for it. Other, unknown attributes outside of this section must generate a validation error and terminate the workflow.~~ | ~~Yes~~ |
| Valid Hierarchy | The schema definitions specify required formats for properties that reference other entities. As such, the syntax check also ensures that references point to entities per the model in the schema definition. Take the [WellLog.1.0.0.json](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Generated/work-product-component/WellLog.1.0.0.json) schema definition as an example. A Well Log has an optional `resourceHomeRegionID` property. If the property is specified, then it must follow the format `^[\\w\\-\\.]+:reference-data\\/OSDURegion:.+:[0-9]*$`, which means any value referencing a type other than an OSDURegion will fail. We can therefore perform the validation check without requiring knowledge of the data's domain. | Yes |
### 5. Pre-Pass ###
### 4. Content (Intent) Validations ###
The Pre-Pass stage validates the intent of the manifest. Syntax validation looks at whether the objects are correctly formed, mandator attributes are present, no unknown attributes are included, and whether properties adhere to the specified type and format. Intent validation takes it one step further and looks at the content as it was presented to determine if it makes sense. Because the manifest ingestion does not have knowledge of data domains, it must continue to rely on validation driven by the contents of the schema definition. The OSDU R3 schemas definitions have extra properties that help provide the context for this validation.
The Content (Intent) validations review the contents to determine if the intent (that is, the ask) is correct and logical. This validation is performed across multiple steps (Cited Data checks, Dataset checks, and Surrogate Key resolution). Because the manifest ingestion does not have knowledge of data domains, it must continue to rely on validation driven by the contents of the schema definition. The OSDU R3 schemas definitions have extra properties that help provide the context for this validation.
Given the schema definitions are used in this validation step and the previous step, effort should be made to reduce fetching the schemas a second time. Note that like step 4, step 5 looks at all schemas reference in the manifest content to perform this validation.
Given the schema definitions are used through these validation checks and the syntax check, effort should be made to reduce fetching the schemas a second time.
_Requirements_
- Validate the manifest payload per the validation rules listed below.
- Any validation errors must be logged and will result in the termination of the workflow.
- Schema validation occurs by fetching from the Schema Service the schema for each `kind` presented in the manifest payload. This requires a full traversal of the manifest content to find each object with a specified `kind` property.
_Design Consideration_
- Put each discrete validation check into its own DAG operator if the validation rule is considered optional. This will allow platform owners a mechanism of recomposing a DAG to exclude those validation rules they wish to skip.
- Schema validation occurs by fetching from the Schema Service the schema for each Master, Reference, Work Product, Work Product Component, and Dataset element presented in the manifest payload.
Some of the rules below rely on the use of the OSDU schema definition extension property `x-osdu-relationship` to perform the validation. Here is an example of how this process might work:
1. Traverse the contents of the manifest's `Data.WorkProductComponents` section, which is likely to contain more than one element.
2. When a `kind` property is found, extract the value of the property.
3. Use the `kind` value to fetch the schema definition from the Schema Service (or via cache).
4. Traverse the schema definition and seek those property definitions that contain an `x-osdu-relationship` definition.
5. Capture the property name. For example, the [WellLog.1.0.0.json](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Generated/work-product-component/WellLog.1.0.0.json) schema has a property defined called `resourceHostRegionIDs` that has within its `items` object an `x-osdu-pattern` definition. We now know that the `resourceHostRegionIDs` property references another OSDU entity. This qualifies this property for additional validation checks. See below for specific validation steps to perform for this situation.
2. Use the `kind` of the element being validated to fetch the schema definition from the Schema Service (or via cache).
3. Traverse the schema definition and seek those property definitions that contain an `x-osdu-relationship` definition.
4. Capture the property name. For example, the [WellLog.1.0.0.json](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Generated/work-product-component/WellLog.1.0.0.json) schema has a property defined called `resourceHostRegionIDs` that has within its `items` object an `x-osdu-pattern` definition. We now know that the `resourceHostRegionIDs` property references another OSDU entity. This qualifies this property for additional validation checks. See below for specific validation steps to perform for this situation.
| Validation Rule | Description | Required? |
| --------------- | ----------- | --------- |
| ~~Duplication~~ | ~~This requires more research. I believe the intent is to validate that MasterData and ReferenceData provided with a pre-set `id` property do not already have an entry in OSDU with the same `id`. If it does, then the process is trying to load duplicate data and it should be rejected. Error is thrown/logged and workflow terminated.~~ | ~~Yes~~ |
| Dataset checks | Ensure that each `data.Dataset` entry within each Work Product Component has a corresponding Dataset entry within the manifest. | Y |
| Cited Data Exists | If a property is found within a `kind`'s schema definition to contain an `x-osdu-relationship` definition, and the value of the property within the manifest payload does not have a `surrogate-key` pattern, then fetch the value from the element in the manifest and leverage the Storage API to determine if the referenced data exists. If it does exist, the validation passes. If not, the validation fails, an error is thrown/logged, and the workflow is terminated. This rule is applicable to references to Reference Data and Master Data. Note that if you are leveraging the version of the operator that permits values that reference data that does not exist, then the workflow is not terminated. | Yes |
| Surrogate Keys | Ensure the use of `surrogate-keys` is consistent and accurate. Ensure all `surrogate-key` references to a parent entity are resolved within the manifest (i.e., no orphaned `surrogate-keys`). The validation process requires identifying within a schema definition the use of the `x-osdu-relationship` extension property and then checking the manifest's value for that property to see if it has the `surrogate-key` pattern (e.g., `^(surrogate-key:.+`). If it does, then an entity must exist within the manifest payload that has an `id` property with a matching `surrogate-key` value. If not, then an invalid reference exists. Throw/log an error and terminate the workflow. | No |
| Cited Data Exists | If a property is found within a `kind`'s schema definition to contain an `x-osdu-relationship` definition, and the value of the property within the manifest payload does not have a `surrogate-key` pattern, then fetch the value and leverage the Storage API to determine if the referenced data exists. If it does exist, the validation passes. If not, the validation fails, an error is thrown/logged, and the workflow is terminated. This rule is applicable to references to Reference Data and Master Data. | No |
Note: We need to determine how best to handle orphaned data that we're able to determine is orphaned.
### 6. Process ###
### 5. Process ###
By the time we reach the Process stage, we've done our best to ensure that the data to be written will succeed. The Process stage will iterate through the manifest and write the data in the correct order while also handling the `surrogate-key` resolution.
......@@ -161,14 +135,14 @@ _Requirements_
- Resolve the `surrogate-key` values using either workflow generated `id`s that conform to the schema definitions pattern or by letting the Storage Service assign the `id`s.
- Log/throw errors and terminate the workflow.
### 7. Storage Service ###
### 6. Storage Service ###
The Manifest Ingestion Workflow will invoke the `RecordAPI.createOrUpdateRecords` API endpoint within the Storage service for data that passes validation.
_Requirements_
- The process may opt to present all records to the Storage Service at once or it may present them at a time or in smaller batches.
- Note that the Storage Service does not support transactions, so rollbacks are not possible. This is the reason for the upfront validation checks to help reduce the manual work required in backing out partially stored manifests.
- A failure of record will not constitute the failure of all contents in the Manifest.
- A failure of a single record will not constitute the failure of all contents in the Manifest.
- A failure of a parent record will prevent the Ingestion Service from processing the child records.
- Responses from the Storage API should be logged where appropriate.
......@@ -181,46 +155,7 @@ _Requirements_
## Potential Roadmap for Manifest Ingestion ##
<table>
<tr>
<th width="275">Horizon 1</th>
<th width="275">Horizon 2</th>
<th width="275">Release</th>
</tr>
<tr>
<td>
[Day 0 of R3 Manifest Ingestion](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/data-prep/docs/-/blob/master/Design%20Documents/Ingestion/Core-Concept-Input_MVE-with-Ingestion-UseCases_Rev-02.pdf). Able to submit a manifest that is prepopulated with required data, including `id`s and successfully write the data via the storage service. Basic schema validation occurs. Basic `exists` checks occur for cited data.
</td>
<td>
Able to process `surrogate-key`s. Integration with the new [Schema Service](https://community.opengroup.org/osdu/platform/system/schema-service). Provide support for Dataset Registry (if available). Provide _hook_ for initiating Ingestion Workflows via published messages from the Storage Service. Integrated testing.
</td>
<td>
Pre-Release activities. Operational readiness. Solution hardening.
</td>
</table>
## Horizon 1 ##
(WIP)
Due Milestone 3.
This is our target for _[Day 0 or R3 Manifest Ingestion](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/data-prep/docs/-/blob/master/Design%20Documents/Ingestion/Core-Concept-Input_MVE-with-Ingestion-UseCases_Rev-02.pdf)_ (that is, the most basic functionality qualifying as Manifest Ingestion). Able to submit a pre-populated manifest with `id`s specified (vs. `surrogate-key`s) using the 1.0.0 version of the [Schema Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) to the Ingestion Service API endpoint.
- Schema validation for R3 schemas (Master Data, Reference Data, Work Product, Work Product Components, and File)
- Additional content validation capabilities, which includes verifying that cited data exists (where derivable via the Schema definitions)
- Load one and only one manifest at a time (bulk loading of Manifests is managed externally to ingestion process)
## Horizon 2 ##
(WIP)
Due Milestone 4.
Able to submit a pre-populated manifest to the Ingestion Service with support for `surrogate-key`s to enable on-write resolution of `id`s and construction of Work Product, Work Product Component, File and Dataset relationships
- Additional content validation capabilities, which includes verifying data relationships are correct per Schema definitions
- Able to coordinate writes to the storage service and properly update `surrogate-key`s specified in the Manifest file to preserve relationships when `id`s are not available at Manifest generation time
- Integrated with the new [Schema Service](https://community.opengroup.org/osdu/platform/system/schema-service) to fetch schemas for validation
- Ingestion Workflow "integration" with the Notification Service to trigger Ingestion Workflows
## Milestone 5 ##
(WIP)
The file Milestone will focus on solution hardening, operational readiness, pre-release activities, and other carry-over items as they are identified.
| Horizon 1 | Horizon 2 | Horizon 3 |
| --------- | --------- | --------- |
| Able to submit a properly constructed manifest, validate syntax, validate the content, and successfully write the data via the storage service. Supports steps 1 - 6 identified above. Target is Milestone 4 | Able to handle moving Dataset.Files from a landing zone to permanent storage. Will not support Dataset.FileCollections. Target is Post-R3 | Support Dataset.FileCollections. Target is Post-R3 |