ADR : EDS DMS - To Reverse the Naturalization
Introduction:
The purpose of this ADR is to address the use case of Reversing Dataset Naturalization for Cleaning Up Operator's Cloud Storage.
Objective:
The objective of this ADR is to outline the approach and decision made to reverse the dataset naturalization (external to internal) process for the WPC dataset IDs to clean up the storage space on the operator's cloud storage. By reverting the dataset IDs to their original state, the aim is to optimize storage utilization, ensure efficient data management, and maintain consistency with the external dataset IDs.
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Scope
The scope of this ADR includes the following scenarios:
- Fetch and Verify the Dataset id:
- Retrieve the internal dataset IDs that needs to be converted back to external one from the WPC inputs also validate if Operation given as input is valid by verifying if the dataset type is internal and DatasetExternal is True.
- Fetch the External dataset id:
- fetch the externaldatasetid (original external dataset id) from the extension property of the internal Dataset schema.
- Reversion Process and Cloud Storage Update
-
Utilize the mapping obtained in step 2, to reverse the internal dataset IDs back to their original external dataset IDs for the connectedsource.generic dataset.
-
Update the associated wellLog with the reversed external dataset IDs.
-
Given /assumptions:
o Well log MetaData already present at the operator end.
Required Changes:
-
eds_dataset: Reverese Naturalization will be added in the same eds_dataset dag already there for naturalization.
-
The DAG will be executed on-demand without the involvement of a scheduler.
-
An additional boolean parameters should be introduced in the internal dataset schemas(File.Generic) to indicate whether the dataset id is internal or naturalized, to keep a track of the naturalized dataset ids.we can achieve this by using the ExtensionProperties of the dataset schemas.
DatasetExternal:True
-
During the naturalization process, it is important to maintain a similar dataset ID while allowing for a conversion of the data type i.e from external to internal, it will be easier for reverse mapping. Example: opendes:dataset--ConnectedSource.Generic:test123 will be converted to: opendes:dataset--File.Generic:test123
we can also have additional parameters within the ExternalProperties like pointer to the external dataset id (connectedSource.Generic) and the original the version. eg: external_dataset_id:opendes:dataset--ConnectedSource.Generic:test123 external_dataset_id_version: 1614105463059152
Inputs:
The inputs for the reverse naturalization process should be an array of WPC IDs and the operation that needs to be performed. Each WPC ID represents a specific work-product-component. Here is an example of the input structure:
{"ids": ["osdu:work-product-component--WellLog:testawsWPC"],"operation":"reverse"}
Implementation:
- Initiate the Process of reverser naturalization on the list of dataset ids on the basis of the operation given if "reverse".
- Verify whether the dataset ID corresponds to an internal data type (File.Generic) for the provided inputs in the DAG.
- Convert the internal dataset ID to external dataset id (ConnectedSource.Generic), which is already present at the operator's end by fetching it from the extension property of the iternal dataset schema.
- Re-ingest the schema ID (WellLog ID) along with the external dataset id (ConnectedSource.Generic) to ensure its updated presence.
- Remove file from the blob storage.
Sequence Diagram
Functional Requirements:
- Dataset ID Mapping:
- Implement a mechanism to map the internal dataset IDs to their corresponding external dataset IDs.
- Type Conversion:
- A universal solution should be developed to facilitate seamless conversion of internal dataset IDs to their corresponding external IDs, regardless of the data type.
- Consistent ID Preservation:
- Same id should be used throughout the entire process only type casting is done.
- Data Validation and Integrity:
- Implement validation checks to ensure the correctness and integrity of the dataset IDs during the naturalization process.
Non-functional Requirements
- Performance and Scalability:
-Ensure that the solution can handle a large volume of data and can perform conversions efficiently
within acceptable time limits.
- Design the solution to be scalable, allowing it to handle increasing data loads and accommodate future growth.
- Reliability and Error Handling:
- Implement robust error handling mechanisms to gracefully handle exceptions and errors during the conversion process.
- Ensure the solution has built-in resilience to recover from failures and minimize disruptions to the overall system.