ADR: Dataset as a Core Service
Dataset as a Core Service
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Changes to be implemented as part of this ADR:
- Rationalize File and Delivery routes into the File service
- Add four new APIs to the File service to address the issues of returning a signedURL for a folder. This will ensure existing implementations are not disrupted by the change, while addressing the issues that currently exist for both File and Delivery.
- Repurpose the Delivery service GitLab repo and CI/CD to the new Dataset service. This will deal with the concerns of the new CI/CD for the Dataset service.
- Deprecate the methods within File that return a signedURL for a folder (causing security concerns); with the deletion date set to beyond r3. This deprecation will be documented and an appropriate removal date will be determined at a later point in time
- DD team will review the schemas for the Dataset Registry.
- Certification will not include deprecated methods
Delivery API moving into File Service
- POST /api/delivery/v2/GetFileSignedUrl -> /api/file/v2/delivery/GetFileSignedUrl
###### Original ADR Below ######
API Documentation
dataset.swagger.yaml OSDU_on_AWS_-_Dataset.pptx
Context & Scope
The Dataset Service introduces a set of APIs to allow for registering, storing, and retrieving datasets.
Today, the capability to store and retrieve Files exist in the File and Delivery services. The Dataset service will obsolete those services and create a singular API to allow for dataset management across all of the resource types that are storable/consumable in OSDU. In addition, the Dataset service introduces the concept of a Dataset Registry. A Dataset Registry is a storage record that contains metadata about a Dataset of any type. This metadata includes the required information for the retrieval of the dataset from its storage location in a cloud agnostic manner.
In order for datasets to be stored and retrieved, underlying Data Management Services (DMS) must exist to handle resource types. Resource types to DMS mapping are registered in a data store/discovery service and used as a lookup for DMS services that can handle storage/retrieval. If a DMS service is registered for a requested Resource Type, then it is called to get storage instructions for the given Resource Type; or retrieval instructions by utilizing information in a Dataset Registry.
Decision
Rationalize the File service and Delivery service and replace them with the Dataset Service
Rationale
- Eliminate the hard dependency on 'File' type that currently exists in Manifests/Work Product/Work Product Components.
- Simplify the API so that all routes related to managing datasets are centralized in a single top down service structure.
- Replacing FileId with DatasetRegistryId in routes/services will allow for non-file based datasets to be cleanly used for workflows.
Consequences
Consumers of the existing File and Delivery services will need to migrate to the Dataset Services' API
Tradeoff Analysis - Input to decision
As we implement Ingestion and Enrichment workflows, having a standardized set of APIs for Dataset management, storage, and retrieval, as well as DMS, will prevent the interaction between services from growing too complex. The Dataset Service provides a singular API that Applications/End Users working with Datasets can operate with.
Decision criteria and tradeoffs
In order to accommodate deprecation strategies for File and Delivery, we need to decide if this service should be introduced as a new standalone service or replace the existing services and deploy in their place.
Decision timeline
Decision ready to be made.