Dataset Registry as a Core Service
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
The Dataset Registry is a convenience method that works above and in conjunction with the Storage service.
Today, we have a "File" object that is embedded in the Manifest that use to create the meta-data for that File. However, when we tried to create something to describe a Dataset that was not a file (e.g. OpenVDS & External Data), we realized that the File Object was not the right way to describe a Dataset. In fact, for OpenVDS (as released for R2) we had to overload the File object in order to make it work. This is not the right way to solve this problem.
In addition, the Ingestion and Ingestion Workflow service have also taken on File specific semantics; they require a FileID in many cases. This will not scale beyond working with a single File.
The solution is very simple, we add a concept called Dataset and an Object called Dataset Registry to the Architecture, and provide a simple mechanism for managing the life-cycle of Dataset Registry. The Dataset Registry is a simple service that adds a semantic layer on top of the existing Storage service. This service provides a way to have different types of Datasets (e.g. File, OpenVDS, External, Binary, etc.), and have an Object that defines the attributes for each of these Datasets as a means of defining more generic data workflows based on the type of Dataset you are dealing with. For a more detailed description, please see the Webex recordings.
Key Use Cases
CSV
- Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
- Store CSV file using the cloud provider SDK for their object storage
- Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- “Ingest” the CSV data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the CSV file as an input parameter), that will execute the “CSV ingestion DAG”; that DAG is not there yet, but, with a little help from SLB, I think we can get it there
WITSML
- Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
- Store WITSML file using the cloud provider SDK for their object storage
- Register the Dataset (for the File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- “Ingest” the WITSML data using the Data Workflow service (code available in GitLab) (given the Dataset Registry Record ID for the WITSML file as an input parameter), that will execute the “WITSML ingestion DAG”; that DAG is not there yet, but, with a little help from Energistics, I think we can get it there
SegY
- Use the File DMS (code available in GitLab) to get the upload location object for the cloud provider
- Store SegY file using the cloud provider SDK for their object storage
- Register the Dataset (for the SegY File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- Store Manifest file (describing the SegY file) using the cloud provider SDK for their object storage
- Register the Dataset (for the Manifest File) using the Dataset Registry service (code available in GitLab), getting back the Dataset Registry Record ID
- “Ingest” the SegY data using the Data Workflow service (code available in GitLab), that will execute the “SegY Manifest ingestion DAG”; we have this DAG already based on the EPAM data loader script
Decision
Introduce the new Dataset Registry service as a core service
Rationale
Eliminate two issues we are currently having: 1/ Manifest today specifies the meta-data for a "File Record", which does not work for other types of Datasets (e.g. OpenVDS), and, 2/ Many services are forced to use a FileID which must have a separate life-cycle management; we can now use the Dataset Registry recordID instead of the FileID.
The Dataset Registry is proposed as a Core service, because it impacts all phases of data that will be registered / managed by the data platform. This resolves the inconsistencies between ingestion and delivery phases.
Consequences
No impact on current services. We can migrate to eliminate the File object in the manifest, as well as the use of FileID in other services.
Tradeoff Analysis - Input to decision
As we move more into R3 and various DMS and DDMS components, this problem is going to be harder to resolve. We need to act now in order to avoid significant rework in the future.
Decision criteria and tradeoffs
The decision is simple, act to resolve the issue identified or do nothing.
Decision timeline
Code is already in open source.