ADR: EDS DMS - Bulk Data Ingestion and Dataset ID Naturalization
Introduction:
The purpose of this ADR is to address the advanced use case of EDS DMS to Add bulk data into the OSDU Platform. Additionally, it focuses on naturalizing the dataset ID associated with the relevant schemas for Work-Product-Component (WPC) record that has linked data files. By importing and naturalizing the dataset IDs, we aim to improve the capabilities of handling bulk data efficiently and to ensure that the data files are properly added to the OSDU Platform and the WPCs' child datasets are converted from "external" to "internal" type, providing improved accessibility and integration of the data within the platform.
Objective:
Currently, the EDS fetch-and-ingest process only copies the metadata, and the child dataset of the WPC at the operator's end is flagged as "external." While operators can use EDS DMS to obtain the actual data file when needed, but it is not properly added to the Data Platform. This approach proposes a solution to address these limitations by appropriately adding the data file to the OSDU Platform and converting the WPC's child dataset from "external" to "internal.”
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Scope
The scope of this ADR includes the following scenarios:
- Importing Bulk Data: Importing bulk data from the provider end into the Data Platform using EDS DMS.
- File Storage in OSDU Instance: The use of the dataset API to store the fetched files back to the operator's OSDU instance.
- Naturalizing Dataset IDs and re ingestion: The naturalization of dataset IDs from external to internal to align them with the relevant schemas within the Data Platform.
Given /assumptions:
- WPC metadata is already present at the operator end.
Required Changes:
-
eds_dataset: New Dag will be introduced to perform the naturalization of the bulk data after the upload of the data files at the operator’s end.
-
The DAG will be executed on-demand without the involvement of a scheduler.
-
An additional boolean parameters should be introduced in the internal dataset schemas(File.Generic) to indicate whether the dataset id is internal or naturalized, to keep a track of the naturalized dataset ids.we can achieve this by using the ExtensionProperties of the dataset schemas.
DatasetExternal:True
- During the naturalization process, it is important to maintain a similar dataset ID while allowing for a conversion of the data type i.e from external to internal, it will be easier for reverse mapping. Example: opendes:dataset--ConnectedSource.Generic:test123 will be converted to: opendes:dataset--File.Generic:test123
we can also have additional parameters within the ExternalProperties like pointer to the external dataset id (connectedSource.Generic) with the version. eg: Parent_dataset_id:opendes:dataset--ConnectedSource.Generic:test123 Parent_dataset_id_version: 1614105463059152
Inputs:
The inputs for the naturalization process should be an array of WPC IDs. Each WPC ID represents a specific work-product-component. Here is an example of the input structure:
[ "osdu:work-product-component--WellLog:7fdf1681b7ed1a1d54046ca1c2438add13719fafd18295a5e35d7bbdb45a53e4", "osdu:work-product-component--WellLog:7fdf1681b7ed1a1d54046ca1c2438ad89cf67"]
Implementation:
- Retrieve the ConnectedSource.Generic by providing the specified input IDs (well Log id, wellboreTrajectory id).
- Initiate the EDS DMS Operator, passing the ConnectedSource.Generic as input, and retrieve the signed URL.
- Download the LAS file at the operator's end using the obtained signed URL.
- Utilize the Dataset File service to upload the downloaded LAS file to the OSDU instance for the operator.
- Convert the child dataset of the WPC from "external" to "internal" by reassigning it to the cloud location mentioned above, such as changing ConnectedSource.Generic to File.Generic.
Exception Handling:
- Dataset ID is not present with the given WPC ids: if dataset id is missing, raise an exception or return an error indicating the issue.
- Invalid or Unrecognized IDs WPC ids or Dataset IDs: Handle cases where the provided WPC ID or dataset ID is invalid or unrecognized at operator’s end.
- Data Retrieval Errors: Implement error handling for situations where the dataset associated with the ID cannot be retrieved.
Sequence Diagram
Functional Requirements:
-
Data Import:
- The system should provide functionality to import bulk data from the provider's end using EDS DMS.
-
EDS DMS Integration:
- The system should integrate with the EDS DMS to establish a connection with the provider's data source.
-
Dataset ID Naturalization:
- The system should support the naturalization of dataset IDs (external to internal) associated with different schemas by uploading the imported data back to the Operator's OSDU instance and creating the internal id.
-
Mapping and re ingest:
- Mapping the transformed dataset id back to the schema id and and proceed to re-ingest the transformed data back to the instance, ensuring its seamless integration.
Non-functional Requirements
- Performance:
- The system should be capable of handling large volumes of wellbore data efficiently, providing fast response times for data retrieval and analysis.
- It should be able to handle concurrent user interactions and maintain performance under peak load conditions.
- Scalability:
- The system should be scalable to accommodate increasing amounts of wellbore data and growing user bases.
- It should be able to handle additional data sources and support a high number of concurrent users without significant degradation in performance.