Changes

Further refinement with the reintroduction of the Ingestion Service and greater clarification on the validation rules.
Alan Henson · 3f7cb46f
--- a/Manifest-Ingestion/R3-Manifest-Ingestion.md
+++ b/Manifest-Ingestion/R3-Manifest-Ingestion.md
@@ -12,7 +12,7 @@ This page captures the Scope, Definition of Done, Horizons, and Milestones for t

 The approach for R3 centers on the following concepts:
 * Pre-ingestion work helps ensure well-formed data enters OSDU (meaning, work is performed outside of OSDU to create the Manifests - see below for additional details on the Manifest itself).
-* The latest Data Definition Schemas ([v1.0.0](https://community.opengroup.org/osdu/data/data-definitions/-/tree/1bdc6e43858d7f0202316135ee4b9a943a26e297)) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement without requiring domain understanding. The Manifest Ingestion process does not have domain context. Subsequent ingestion workflows made possible by the Ingestion Framework support [DDMS](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation/Domain-&-Data-Management-Services/DDMS-&-Data-Governance) ingestion processes. Note that the schemas reflected above represent the Data Definition team's work on defining the Well Know Structure (WKS) format for OSDU data types to promote and encourage interoperability. You can view the latest Data Definitions schemas on the [Data Definitions GitLab site](https://community.opengroup.org/osdu/data/data-definitions). 
+* The latest Data Definition Schemas ([v1.0.0](https://community.opengroup.org/osdu/data/data-definitions/-/tree/1bdc6e43858d7f0202316135ee4b9a943a26e297)) provide robust data modeling and relationship modeling capabilities that enable programmatic enforcement without requiring domain understanding. The Manifest Ingestion process does not have domain context. Note that the schemas reflected above represent the Data Definition team's work on defining the Well Know Structure (WKS) format for OSDU data types to promote and encourage interoperability. You can view the latest Data Definitions schemas on the [Data Definitions GitLab site](https://community.opengroup.org/osdu/data/data-definitions). 
 * Loading by Manifest using the schemas defined by the Data Definitions team ensures the metadata describing the underlying source data adheres to the [Well Known Structure](https://community.opengroup.org/osdu/documentation/-/wikis/OSDU-(C)/Design-and-Implementation//Entity-and-Schemas/Demystifying-Well-Known-Schemas,-Well-Known-Entities,-Enrichment-Pipelines) concept, which supports interoperability and a [promise](https://osduforum.org/about-us/who-we-are/osdu-mission-vision/) of OSDU. While the Manifest Ingestion process focuses on loading metadata described in a Manifest, OSDU R3 allows for the registration of new schemas and the Ingestion Framework enables new ingestion workflows, which empowers others to load data in other formats and in compliance with their data management standards and processes by writing custom ingestion workflows.
 * The intent of the Manifest Ingestion is to create a mechanism to load source data in its original format while enabling discovery (index, search, deliver). The Ingestion Framework enables more complex workflows capable of building more robust datasets using the source data through workflows focused on enrichment, parsing, etc. Approaching ingestion in this manner preserves the source data while also creating and presenting consumption ready data products.

@@ -22,116 +22,146 @@ The scope for R3 Manifest Ingestion is documented via Ingestion Uses cases found

 ### In-Scope ###

-![R3_Ingestion_Workflows-Ingestion_Service_Workflow-Numbered](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/uploads/7a67975073e4a88fb0740c7f944c1ef3/R3_Ingestion_Workflows-Ingestion_Service_Workflow-Numbered.png)
+![R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/uploads/e681ce98b53834e13598c85a068bb048/R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple.png)

 The picture above depicts the conceptual architecture for the R3 Manifest Ingestion scope. Much of the complexity has been extracted for the sake of simplicity, but the picture hopefully illustrates the intent. We will define scope through the Definition of Done. In short, the following is considered _In-Scope_ for R3. The numbers presented in the diagram will receive additional context below. Furthermore, the Manifest Ingestion workflow runs within the Ingestion Framework. Therefore, the architecture above is meant to illustrate the components the Manifest Ingestion workflow depends on and not the architecture designed for the Manifest Ingestion workflow. In other words, the Manifest Ingestion workflow is a tenant of the Ingestion Framework.
- Validate (Syntax and Content) and Process the contents of a [Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) into OSDU via the Storage Service. Validation will occur within the workflow to promote scalability given the workflows are executed asynchronously and a large manifest file could take some time to validate. The validation should be optional in that a process can elect to skip or ignore validation errors. The Manifest Ingestion is meant to represent best practices, but cannot enforce them given it is not meant as a prescriptive data management solution.
- At the completion of the Manifest Ingestion workflow, a notification must be generated indicating that the workflow is complete, allowing other workflows to initiate.
- A proposal to deprecate the Ingestion Service in favor of working directly with the Ingestion Framework.
+- Validate (Syntax and Content) and Process the contents of a [Manifest](https://community.opengroup.org/osdu/data/data-definitions/-/blob/1bdc6e43858d7f0202316135ee4b9a943a26e297/Generated/manifest/Manifest.1.0.0.json) into OSDU via the Storage Service. Validation will occur within the workflow to promote scalability given the workflows are executed asynchronously and a large manifest file could take some time to validate. Some of the validation might be optional in that a process can elect to skip or ignore the validation step through recomposition of the workflow data operators. The intent is to preserve the integrity of the platform without prescribing data management practices.
+- At the completion of the Manifest Ingestion workflow, a notification must be generated indicating that the workflow is complete, allowing other workflows to initiate. This capability must come post-R3.

 ### Definition of Done ###

 In it's simplest form, the Manifest Ingestion Workflow works as follows:
- A well-formed Manifest is created externally to OSDU and presented to the Ingestion Workflow (via the `startWorkflow` endpoint in the `os-workflow` module)
- The workflow will perform optional validation against the schema (syntax, structure, verifying reference data exists, determining relationship conformity where possible, etc.)
- After validation (if successful or if validation is optionally skipped), the workflow will invoke the Storage API to save the records, which will in turn trigger indexing enabling search
+- A well-formed Manifest is created externally to OSDU and presented to the Ingestion Service (`os-service` module). The Ingestion Service serves as a common entry point that offers structure to forming an ingestion request.
+- Some lightweight validation may occur at this step, such as ensuring the request is well-formed. If any errors occur, an exception will be thrown. If the request is properly structured, then the Ingestion Service will look to the Ingestion Workflow (`os-workflow`) to initiate the appropriate workflow for the manifest ingestion process.
+- The workflow first ensures the manifest is syntactically correct. Then it ensures the content (or intent) is correct. Then the workflow persists the manifest. Any validation errors should result in the termination of the workflow.
 - The Manifest Ingestion Workflow must be Cloud Service Provider (CSP) agnostic

-There are three primary phases the Manifest Ingestion Process undergoes:
+There are four primary phases the Manifest Ingestion Process undergoes:
+- Payload Resolution - The payload was either submitted to the Ingestion Service or a pointer to the payload was provided.
 - Syntax Validation - This stage ensures the Manifest is structurally (syntactically) correct based on the referenced schemas (identified by the `kind` property). Schema Validation occurs as follows:
    1. The submitted Manifest file is validated against the registered Manifest Schema within OSDU based on the submitted Manifest's `kind`
    2. The Manifest has embedded data that have their own Schemas, which are also identified by their `kind`. The Schema Validation process will fetch and validate each component of the Manifest that has a specified `kind`. If any of the validations fail, errors are logged and the Manifest Ingestion workflow is terminated unless the invoking process opted to ignore errors.
 - Pre-Pass - Here, the Manifest content is checked for intent. The OSDU schema definitions indicate additional information that permits additional validation without having knowledge of the data's domain. For example, cited data where an `id` is present, suggests the data should exist in the destination OSDU instance. The validation will ensure that data exists. Additionally, if the data is Reference Data or Master Data and a valid `id` is presented and the data exists, the validation will fail the Manifest to avoid duplication. The `x-osdu-relationship` annotation of OSDU schemas also informs relationships, which can be validated programmatically. If the submitting process opts to ignore errors, then any validation errors encountered will be logged, but the Manifest Ingestion workflow will not be terminated.
 - Process - At this stage, the data has been validated, and either no errors were discovered, or the submitting process opted to ignore the errors. Version 1.0.0 of the OSDU schemas leverage `surrogate-key`s to represent relationships between the data elements prior to persisting the data as the `id`s are unknown when the Manifest is created. The Process stage will replace the `surrogate-key` instances with generated `id`s to ensure referential integrity is maintained. Note that this stage will create the `id`s within the workflow vs. letting the Storage service create the `id`s. This is done to reduce the complexity in determining the correct graph order to write all the presented data. The `id`s will be created as per the pattern defined in the schemas so as to be consistent with the platform. If the desire is to have the system create the `id`s, then leverage multiple Manifests and an external process to govern the order of the writes, or have the external process invoke the Storage API directly.

-The intent of the validation checks is to minimize the work required to manually address data loading errors should they occur. Release 3 of OSDU does not have a rollback mechanism. The submitting process may opt to ignore errors, but doing so may result in manual efforts to resolve data issues.
+The intent of the validation checks is to minimize the work required to manually address data loading errors should they occur. Release 3 of OSDU does not have a rollback mechanism. 

-As data is successfully persisted by the Storage Service, notifications are generated. A process within the Ingestion Framework may subscribe to the Notifications to initiate Ingestion Workflows that take additional action on the data. Additionally, a notification is also generated on the completion of a workflow. A Workflow Trigger will subscribe to these notifications enabling it to initiate additional workflows as configured. The trigger capability is likely to come post R3.
-
-### How Manifest Ingestion will work ###
+### The Manifest Ingestion Process ###

 The following diagram illustrates the workflow and the sequence of steps, which are further described below.

-![R3_Ingestion_Workflows-Ingestion_Service_Workflow-Numbered](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/uploads/7a67975073e4a88fb0740c7f944c1ef3/R3_Ingestion_Workflows-Ingestion_Service_Workflow-Numbered.png)
+![R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple](https://community.opengroup.org/groups/osdu/platform/data-flow/ingestion/-/wikis/uploads/e681ce98b53834e13598c85a068bb048/R3_Ingestion_Workflows-Ingestion_Service_Workflow_-_Simple.png)

 NOTE: Where "Manifest Ingestion Workflow" is referenced, we are referring to the Directed Acyclic Graph (DAG) responsible for processing the Manifest data. This DAG contains one or more operators that process the data in some manner. The collection of operators makes a DAG and the DAG is what the Ingestion Workflow Service will execute.

+### Open Question ###
+1. How does a re-process manifest ingest request change the behavior of the ingestion? Theoretically, the validation might be different in that objects with an `id` value specified must already exist within the platform. In an initial processing request, specified `id` values assume that an external process has determined the `id` of the entity vs. letting the Storage service make that determination.
+2. How should the `IsExtendedLoad` and `IsDiscoverable` flags within the `AbstractAnyRecordWorkProduct` and `AbstractAnyRecordWorkProductComponent` schema definitions affect the ingestion process? 
+
 ### 1. Initiating Ingestion ###

-The Manifest Ingestion process is initiated via the Ingestion Workflow Service (the module name is `os-workflow` and the API endpoint is called `startWorkflow`). 
+The Manifest Ingestion process is initiated via the Ingestion Service (the module name is `os-ingestion`). However, given the Workflow Service is leveraged, a process could invoke the manifest ingestion workflow directly through the Ingestion Workflow Service.

 _Requirements_
- The Workflow Service should ensure the calling process is authenticated and authorized to initiate a workflow
- Workflows must be discoverable via a registry using a unique name to allow maximum flexibility and reusability
- The caller can provide a workflow payload that either contains the manifest payload or a standardized structure containing a pointer to the payload that was pre-loaded to storage (think Datasets or Files and passing references to data vs. passing the data itself)
-
-In R2, ingestion was initiated via the Ingestion Service. Given improvements in the Ingestion Framework (`os-workflow`), it is proposed that we deprecate the Ingestion Service. As such, this approach does not recommend initiating a Manifest Ingestion Workflow via the Ingestion Service.
+- The Ingestion Service should ensure the calling process is authenticated and authorized to perform ingestion.
+- The caller can provide a workflow payload that either contains the manifest payload or a standardized structure containing a pointer to the payload that was pre-loaded to storage (think Datasets or Files and passing references to data vs. passing the data itself).
+- The call can also provide the name of a workflow if a workflow other than the default workflow should be initiated for manifest ingestion. This allows greater flexibility in ingesting data.
+- The Ingestion Service should validate the request is properly structured and correct.
+- Any errors should produce an exception that is thrown or logged and the process should terminate. For R3 we will not support robust batch processing allowing for partial failures. We do not have a good rollback mechanism so we must protect the integrity of the platform by detecting errors early and reducing manual work to resolve partial writes.
+- If security and initial validation checks pass, the Ingestion Service should invoke the correct workflow via the Ingestion Workflow Service.
+- A successfully initiated workflow via the Ingestion Workflow Service will produce a `workflowId`, which should be returned to the Ingestion Service caller to enable workflow status queries.

 ### 2. Initiating the Workflow ###

-The Workflow Service contained within the Ingestion Framework is capable of initiating a workflow. OSDU R3 leverages Apache Airflow for workflow execution. On initiating a workflow, a `workflowId` is created, which may be used to fetch the status of the workflow from the Ingestion Workflow Service (i.e., `WorkflowStatusAPI.getWorkflowStatus`). The workflow is the mechanism by which manifest data is processed into OSDU.
+The Workflow Service contained within the Ingestion Framework is capable of initiating a workflow. OSDU R3 leverages Apache Airflow for workflow execution. On initiating a workflow, a `workflowId` is created, which may be used to fetch the status of the workflow from the Ingestion Workflow Service (i.e., `WorkflowStatusAPI.getWorkflowStatus`). 

 _Requirements_
- The Ingestion Framework should validate the named workflow exists within the workflow registry and throw an error if the named workflow is not found
- If the named workflow does exist within the registry, the Ingestion Workflow Service must initiate the workflow with the presented payload
- On initiating a workflow, a unique `workflowId` must be created and returned to the calling process
- Whatever payload was provided to the Ingestion Workflow Service must be presented to the initiated workflow for processing
-
-### 3. Executing the ingestion workflow ###
+- Workflows must be discoverable via a registry using a unique name to allow maximum flexibility and reusability.
+- The Ingestion Framework should validate the named workflow exists within the workflow registry and throw an error if the named workflow is not found.
+- If the named workflow does exist within the registry, the Ingestion Workflow Service must initiate the workflow with the presented payload.
+- On initiating a workflow, a unique `workflowId` must be created and returned to the calling process.
+- The payload provided to the Ingestion Workflow Service must be presented to the initiated workflow for processing.

 The Manifest Ingestion workflow is a default ingestion workflow capable of processing version 1.0.0 of the Manifest Schema, which may contain Master Data, Reference Data, Work Work Product, Work Product Components, or Files (note here that "File" is a superset of data sources that is capable of representing Datasets - see the [Dataset as Core Service ADR](https://community.opengroup.org/osdu/platform/system/home/-/issues/65#register-pane) for more information).

-The intent of the Manifest Ingestion Workflow is to provide out-of-the-box capability within OSDU to store and define source data in its original format while making that data searchable and discoverable via the metadata provided in the manifest. In other words, load source data (e.g., a dataset) to OSDU. Create a manifest file. Leverage the Manifest Ingestion Workflow to submit the Manifest to provide metadata about the source data to OSDU. That metadata is validated and then stored, which triggers indexing processes that make the source data discoverable via defined metadata in the manifest.
+The intent of the Manifest Ingestion Workflow is to provide out-of-the-box capability within OSDU to store and define source data in its original format while making that data searchable and discoverable via the metadata provided in the manifest. 

-During this step, the validation process outlined above occurs. There are multiple validation requirements, which are defined in the table below.
+Steps 3 - 6 occur within a DAG and each step represents a separate DAG Operator.

-_Requirements_
+### 3. Payload Resolution ###

- Obtain the manifest payload, either through the payload presented to the workflow or via accessing the referenced Dataset (see Step #1 for more details)
- Validate the manifest payload per the validation rules listed below
- Any validation errors must be logged and will result in the termination of the workflow
- Schema validation occurs by fetching from the Schema Service the schema for each `kind` presented in the manifest payload
+This is a placeholder for a potential operator that could resolve the manifest using a pointer to a dataset that is the manifest itself. In this scenario, a process created the manifest, stored it via the dataset service (in the delivery service), and obtained a dataset id. The dataset id was then presented to the Ingestion Service for processing as a reference to the manifest to be ingested vs. passing the full manifest itself. This would allow data to be passed by reference rather than by value.
   
+### 4. Syntax Checking ###

-| Type | Validation rule | Description |
-| ---- | --------------- | ----------- |
-| Syntax | Syntax check | Ensure the full schema and all referenced schemas adhere to the defined schemas registered with the Schema Service based on `kind`. This includes structure, syntax, mandatory fields, unknown attributes, and attribute pattern adherence. |
-| Pre-Pass | Cited Data Exists | Ensure referenced items that have reference id property field populated, exist within OSDU |
-| Pre-Pass | Duplication | Ensure presented Master and Reference data that has the `id` field populated does not already exist. This is to prevent duplication. |
-| Pre-Pass | Valid Hierarchy | Leveraging the `x-osdu-relationship` definition within the OSDU schemas, ensure presented data that his hierarchical in nature adheres to the relationships defined within the referenced schema. |
-| Pre-Pass | Surrogate Keys | Ensure the use of `surrogate-keys` is consistent and accurate. Ensure all `surrogate-key` references to a parent entity are resolved within the manifest (i.e., no orphaned `surrogate-keys`). |
+The first validation to occur is Syntax Checking. This validation leverages the schema definitions to ensure the manifest content adheres to the schema definitions of both the manifest and the data it contains. There are two primary syntax checks:

-### 4. Supporting Services ###
+1. The manifest itself is validated against the manifest schema definition.
+2. The contents of the manifest reference schemas using the `kind` property. Because the manifest reference Master Data, Reference Data, Work Product, Work Product Components, and data containers (identified by the `Files` property), there will be additional schemas referenced that must also be verified against their schema definition. 

-Provided for illustrative purposes, but as part of the Schema Validation, the Manifest Ingestion Workflow will fetch Schemas from the Schema Service for the `kinds` identified in the manifest payload.
+_Requirements_

-### 5. Storage Service ###
+- Validate the manifest against its schema definition by fetching the schema from the Schema Service using the manifest's `kind` property.
+- Validate the manifest payload per the validation rules listed below.
+- Any validation errors must be logged and will result in the termination of the workflow.
+- Schema validation occurs by fetching from the Schema Service the schema for each `kind` presented in the manifest payload. This requires a full traversal of the manifest content to find each object with a specified `kind` property.

-The Manifest Ingestion Workflow will invoke the `RecordAPI.createOrUpdateRecords` API endpoint within the Storage service for data that passes validation. 
+
+| Validation Rule | Description | Required? |
+| --------------- | ----------- | --------- |
+| Manifest Syntax check | Fetch the schema definition from the Schema Service for the `kind` property of the manifest. Validate that the entire manifest is correct according to its schema validation. If the `kind` does not map to a Schema Definition, throw/log an error and terminate the workflow. | Yes |
+| Content Syntax Validation | Traverse the manifest for `kind` properties. For each `kind` property found, retrieve the schema definition from the Schema Service. Validate the entire object containing the `kind` property against the schema returned by the Schema Service for that `kind`. If the Schema Service is unable to find a Schema Definition for a given `kind` throw/log an error and terminate the workflow. | Yes |
+| Unknown attributes | The validation should ensure 
+| Valid Hierarchy | You might expect this validation rule to exist in the pre-pass, but the schema definitions allow us to get this rule for free as part of the property format validation. Take the [WellLog.1.0.0.json](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Generated/work-product-component/WellLog.1.0.0.json) schema definition as an example. A Well Log has an optional `resourceHomeRegionID` property. If the property is specified, then it must follow the format `^[\\w\\-\\.]+:reference-data\\/OSDURegion:.+:[0-9]*$`, which means any value referencing a type other than an OSDURegion will fail. We can therefore perform the validation check without requiring knowledge of the data's domain. | N |
+
+### 5. Pre-Pass ###
+
+The Pre-Pass stage validates the intent of the manifest. Syntax validation looks at whether the objects are correctly formed, mandator attributes are present, no unknown attributes are included, and whether properties adhere to the specified type and format. Intent validation takes it one step further and looks at the content as it was presented to determine if it makes sense. Because the manifest ingestion does not have knowledge of data domains, it must continue to rely on validation driven by the contents of the schema definition. The OSDU R3 schemas definitions have extra properties that help provide the context for this validation. 
+
+Given the schema definitions are used in this validation step and the previous step, effort should be made to reduce fetching the schemas a second time. Note that like step 4, step 5 looks at all schemas reference in the manifest content to perform this validation.

 _Requirements_
- The process may opt to present all records to the Storage Service at once or it may present them at a time or in smaller batches.
- Note that the Storage Service does not support transactions, so rollbacks are not possible. This is the reason for the upfront validation checks to help reduce the manual work required in backing out partially stored manifests.
- A failure of record will not constitute the failure of all contents in the Manifest.
-    - A failure of a parent record will prevent the Ingestion Service from processing the child records.
- Responses from the Storage API should be logged where appropriate.

-### 6. & 7. - Notification Service ###
+- Validate the manifest payload per the validation rules listed below.
+- Any validation errors must be logged and will result in the termination of the workflow.
+- Schema validation occurs by fetching from the Schema Service the schema for each `kind` presented in the manifest payload. This requires a full traversal of the manifest content to find each object with a specified `kind` property.

-There are two instances in which the Notification Service is invoked via processes performed by the manifest ingestion workflow.
+_Design Consideration_
+- Put each discrete validation check into its own DAG operator if the validation rule is considered optional. This will allow platform owners a mechanism of recomposing a DAG to exclude those validation rules they wish to skip.

-1. When a record is successfully created via the Storage endpoint, notifications are generated that trigger other services (e.g., Indexing). A downstream process could register for this notification and then take action. For example, loading a Seismic Trace with an underlying SegY file. Doing so might trigger a downstream ingestion enrichment workflow that creates an OpenVDS artefact for the SegY file so the Seismic Trace is now represented by its original SegY file and now via the OpenVDS artefacts.
-2. When the manifest ingestion workflow completes, it will send a message to the notification service indicating it has completed processing a manifest. This will enable downstream processes to kick-off if additional processing is needed (think enrichment, quality checks, etc.). This feature may not be fully supported in the OSDU 3.0 (R3), but may come as a minor feature later (i.e., 3.x).
+Some of the rules below rely on the use of the OSDU schema definition extension property `x-osdu-relationship` to perform the validation. Here is an example of how this process might work:

-### 8. Event-Driven Workflows ###
+1. Traverse the contents of the manifest's `Data.WorkProductComponents` section, which is likely to contain more than one element.
+2. When a `kind` property is found, extract the value of the property.
+3. Use the `kind` value to fetch the schema definition from the Schema Service (or via cache).
+4. Traverse the schema definition and seek those property definitions that contain an `x-osdu-relationship` definition.
+5. Capture the property name. For example, the [WellLog.1.0.0.json](https://community.opengroup.org/osdu/data/data-definitions/-/blob/master/Generated/work-product-component/WellLog.1.0.0.json) schema has a property defined called `resourceHostRegionIDs` that has within its `items` object an `x-osdu-pattern` definition. We now know that the `resourceHostRegionIDs` property references another OSDU entity. This qualifies this property for additional validation checks. See below for specific validation steps to perform for this situation.

-This feature may not be supported in the OSDU 3.0 (R3), but may come as a minor feature later (i.e., 3.x).
+| Validation Rule | Description | Required? |
+| --------------- | ----------- | --------- |
+| Surrogate Keys | Ensure the use of `surrogate-keys` is consistent and accurate. Ensure all `surrogate-key` references to a parent entity are resolved within the manifest (i.e., no orphaned `surrogate-keys`). The validation process requires identifying within a schema definition the use of the `x-osdu-relationship` extension property and then checking the manifest's value for that property to see if it has the `surrogate-key` pattern (e.g., `^(surrogate-key:.+|[\\w\\-\\.]+:`). If it does, then an entity must exist within the manifest payload that has an `id` property with a matching `surrogate-key` value. If not, then an invalid reference exists. Throw/log an error and terminate the workflow. | N |
+| Duplication | This requires more research. I believe the intent is to validate that MasterData and ReferenceData provided with a pre-set `id` property do not already have an entry in OSDU with the same `id`. If it does, then the process is trying to load duplicate data and it should be rejected. Error is thrown/logged and workflow terminated. | Y |
+| Cited Data Exists | If a property is found within a `kind`'s schema definition to contain an `x-osdu-relationship` definition, and the value of the property within the manifest payload does not have a `surrogate-key` pattern, then fetch the value and leverage the Storage API to determine if the referenced data exists. If it does exist, the validation passes. If not, the validation fails, an error is thrown/logged, and the workflow is terminated. This rule is applicable to references to Reference Data and Master Data. | Y |

-While this may be out of scope for R3, the Notification Service may be used to indicate the status of ingestion activities. These statuses could indicate to registers that additional workflows should be initiated. The intent is to allow for a chain of workflows to initiate to process incoming records without having to tightly couple those workflows together. This is an event-driven architecture design.
+### 6. Process ###

-### 9. Post-Manifest Workflows ###
+By the time we reach the Process stage, we've done our best to ensure that the data to be written will succeed. The Process stage will iterate through the manifest and write the data in the correct order while also handling the `surrogate-key` resolution.
+
+_Requirements_
+- Write Reference Data, then Master Data, then File Data, then Work Product Data, then Work Product Component Data.
+- Resolve the `surrogate-key` values using either workflow generated `id`s that conform to the schema definitions pattern or by letting the Storage Service assign the `id`s.
+- Log/throw errors and terminate the workflow.
+
+### 7. Storage Service ###
+
+The Manifest Ingestion Workflow will invoke the `RecordAPI.createOrUpdateRecords` API endpoint within the Storage service for data that passes validation. 

-As mentioned above, the Ingestion Framework will support multiple ingestion workflows. These additional workflows will take on tasks such as enrichment, quality checking, insight extraction, analytics, and more. It's important that the framework allow OSDU consumers the ability to author their own workflows and specify how those workflows are initiated (manually, event-driven, other workflows, etc.).
+_Requirements_
+- The process may opt to present all records to the Storage Service at once or it may present them at a time or in smaller batches.
+- Note that the Storage Service does not support transactions, so rollbacks are not possible. This is the reason for the upfront validation checks to help reduce the manual work required in backing out partially stored manifests.
+- A failure of record will not constitute the failure of all contents in the Manifest.
+    - A failure of a parent record will prevent the Ingestion Service from processing the child records.
+- Responses from the Storage API should be logged where appropriate.

 ### Out-of-Scope ###
 - The design and implementation of the Manifest Ingestion process may require updates to other OSDU services and components within the platform. Where those changes are required, we will submit the required ADRs and work through the required processes to have those items approved and implemented by the identified teams who own the delivery of those services.