[ADR] seismic storage tiers support
Introduction
This ADR proposes how support to multi storage tiers should be enabled in SDMS to better manage storage costs.
Status
-
Initiated -
Proposed -
Under Review -
Approved -
Rejected
SDMS dataset concepts and ingestion overview
A dataset resource, in SDMS, is identified by the follow URI string
sd://tenant/subproject/path/dataset
where:
- tenant: is the unique data-partition-id.
- subproject: is the name of the data group.
- path: is a virtual path in the subproject (a folder three).
- dataset: is the name of the dataset.
For example in the dataset sd://opendes/sandbox/processing/2023/result.zgy
- tenant = opendes
- subproject = sandbox
- path = /processing/2023/
- dataset = result.zgy
A dataset in SDMS is composed by a metadata descriptor, maintained in the SDMS db catalogue, and a set of objects saved in a cloud storage resource. A dataset through SDMS is always seen as a single entity even if its data content has been split into multiple objects.
In general, SDMS, has a dedicated storage resource per partition and all objects composing datasets are stored into it. All objects composing a dataset are saved into the storage account in different way based on the storage policy applied at data group level (subproject):
-
subproject access policy = "uniform": access is granted at data group level. A subproject's writer/reader can write-read/read any dataset in the subproject. For each subproject a dedicated storage resource is created and all objects composing the dataset are saved under a virtual folder path. For example, in Azure, a dataset is saved into storage-account(per partition)\container(per subproject)\virtual-folder(per dataset)\object_0...object_N.
-
subproject access policy = "dataset": access is granted at dataset level. A dataset writer/reader can write-read/read only the datasets he has access too. For each dataset a dedicated storage resource is created and all objects composing the dataset are saved under a virtual folder path. For example, in Azure, a dataset is saved into storage-account(per partition)\container(per dataset)\virtual-folder(per dataset)\object_0...object_N.
When a dataset is uploaded to SDMS, the following ingestion flow is executed:
- Client register the dataset in SDMS. SDMS will create a dataset descriptor in the internal catalogue and reserve a storage area where upload dataset composing objects.
- SDMS returns the descriptor metadata to the client.
- Client request a connection string to SDMS for the reserved storage resource.
- SDMS returns the generated connection string to the Client.
- Client split the dataset into multiple object and upload them to the reserved storage resource.
- Client request to SDMS to close the dataset.
Storage Tiers and SDKs support
To provide a cost-effective solution, SDMS must enable storage tiers management features in order to save dataset's composing objects to a specific storage tier class. For example, in Azure, supported tiers class are Hot, Cold, Archive. If data objects can be saved into a Cold tier instead of a Hot, a cost saving will be generated for clients.
An object's tier can be set or updated directly by calling cloud storage methods when an object is uploaded or manipulated.
These operations are executed from client applications through CSP provided SDKs. The SDMS suite offers 2 clients libraries: a C++ client library SDAPI and a Python command line utilities SDUTIL.
This ADR presents how SDMS service and provided client libraries, SDAPI and SDUTIL, should be enhanced to support object's tiering features.
Set storage tier class
This features enables consumer to set the desired storage tier class when objects are uploaded.
In SDAPI, we will add a storage tier class argument to the generic dataset opening method. This will be set as default storage tier class when a dataset object is uploaded. If not set, the default storage tier class will be used (Hot for Azure). In addition the tiering argument will be added to both dataset and utility upload method provided to ingest in a local dataset in SDMS as single object.
// open a dataset specifying the storage tier class.
SDGenericDataset dataset(&manager, "sd://tenant/subproject/path/dataset");
dataset.open(SDDatasetDisposition::CREATE|OVERRIDE, {
{ api::json::Constants::tier, Tier::<tier-class>}});
// object will be uploaded with the dataset specified <tier-class>
dataset.write("object_name", data, size);
// save the storage tier information in the manifest
dataset.close();
// upload a dataset: generic dataset class
SDGenericDataset dataset(&manager, "sd://tenant/subproject/path/dataset");
dataset.upload("fileToUploadPath", Tier::Cold);
// save the storage tier information in the manifest
dataset.close();
// upload a dataset: utility class
SDUtils utils(&manager);
utils.upload("sd://tenant/subproject/path/dataset", "fileToUploadPath", Tier::Cold);
In case the dataset already exist and is opened with a READ_WRITE disposition, the tier should be set as the one specified in the manifest. In case this is not present the default one should be applied.
The SDUTIL utility, does not provide methods to manipulate objects. Dataset are uploaded to SDMS via cp command that automatically split the dataset into multiple objects and upload them in the storage resource. The tier class can be specified in the upload version of the cp command. All dataset's composing objects will be uploaded to the cp command specified tier class.
sdutil cp data sd://tenant/subproject/path/data --tier="<tier_class>"
In both SDUTIL and SDAPI, when dataset is closed (at the end of an upload operation), the storage tier class value must be set in the dataset manifest in order to make the tier information available and trusted by ingestion and consuming applications.
manifest: {
type: "most of the times set to \"GENERIC\"",
nojbects: "number of dataset's objects",
size: "the dataset size",
checksum: "the dataset checksum",
tier_class: "the storage tier class"
}
Please note that SDMS is in control of changes made by SDAPI and SDUTIL applications only. if other apps are used to change the dataset object's storage tier class, these must also updated the dataset manifest(by calling the PATCH /dataset endpoint).
Update storage tier class
This features enables consumer to updated the desired storage tier class to an uploaded object.
In SDAPI, we will add a new method to the generic dataset and the utility classes to update the dataset object's storage tier class.
// uploaded a dataset tier class: generic dataset class
SDGenericDataset dataset(&manager, "sd://tenant/subproject/path/dataset");
dataset.open(SDDatasetDisposition::READ_WRITE);
// update the storage class tier of all dataset's objects
dataset.update(Tier::<Tier::class>);
// save the storage tier information in the manifest
dataset.close()
// uploaded a dataset tier class: utility class
SDUtils utils(&manager);
utils.update("sd://tenant/subproject/path/dataset", Tier::Cold);
In SDUTIL, we will update the patch command to update the storage tier class to all dataset's objects :
sdutil patch sd://tenant/subproject/path/dataset --tier=<tier_class>
Retrieve storage tier class
To know what is the dataset's storage tier class, client applications can retrieve the dataset descriptor and read the content of the associated value in the manifest. both SDAPI and SDUTIL should be updated to expose the new value. In SDAPI the dataset model will be updated adding the extra property and in SDUTIL we will enhanced the stat command by adding the tier class information to the detailed command output.
- Name: sd://test-partition/sandbox/cube.zgy
- Created By: dmolteni3@.com
- Created Date: Tue May 16 2023 11:16:08 GMT+0000 (Coordinated Universal Time)
- Size: 36.0 MB
- No of Objects: 2
- Legal Tag: test-partition-default-legal
- Storage Tier Class: Hot