[ADR] Synching SDMS V4 datasets in SDMS V3
Introduction
We need a solution for make dataset ingested in SDMS V4 visible and consumed by SDMS V3.
The purpose of this ADR is to describes how to enable a synchronization mechanism that allows users of SDMS V3 to consume seismic dataset entities ingested in SDMS V4, even though the two versions of the system have entirely different architectural logics.
Status
-
Initiated -
Proposed -
Under Review -
Approved -
Rejected
Problem statement
The Seismic Data Management Service V4 (SDMS V4) stores and manages data types as defined by the Open Subsurface Data Universe (OSDU) Authority. The APIs (Application Programming Interfaces) provide robust data type checks and are fully integrated with the OSDU policy service. The goal is to minimize ambiguity in the authorization model and facilitate straightforward adoption through a consistent usage pattern. In contrast, the V3 version of the service defines, saves, and manages proprietary metadata records, interacts directly with the entitlement service, and organizes records into collections/data-groups named subprojects.
The key difference between the two versions of the service lies in the form of the record. In the case of the OSDU record adopted by SDMS V4, it is entirely managed by the storage service. However, the V3 metadata has its own format, and to locate a dataset ingested in SDMS V4 via V3, it is necessary to create a V3 proprietary record. The following section will describe how an OSDU record can be translated into a V3 record to enable the synchronization process between the systems
Proposed solution
Create a new service capable of detecting when a new dataset is registered in SDMS V4 and creating the corresponding record in SDMS V3
Overview
As previously noted, in SDMS V3, the dataset descriptor has a proprietary structure and is maintained in an internal catalog. However, in SDMS V4, the descriptor is a standard OSDU record managed by the storage service. To make a datasets, ingested in SDMS V4, visible in SDMS V3 we must create a corresponding V3 metadata. This section describes how an SDMS V3 record can be created, using the OSDU record details, to make the ingested dataset in V4 visible in V3
The SDMS V3 dataset descriptor
{
"id": "the record id <used as key in the service journal catalogue>",
"data": {
"name": "the dataset name",
"tenant": "the tenant name",
"subproject": "the subproject name",
"path": "the dataset virtual folder path",
"acls": {
"admins": "list of entitlement groups with admin rights",
"viewers": "list of entitlement groups with viewer rights"
},
"ltag": "the associated legal tag",
"created-by": "the id of the user who ingested the dataset",
"created_date": "the date and time when the dataset was ingested",
"last_modified_date": "the date and time when the dataset was last modified",
"gcsurl": "the storage uri string where bulks are saved",
"ctag": "a coherency hash tag that changes every time this record is modified",
"readonly": "the access mode level",
"filemetadata": {
"nobjects": "the number of blobs composing the dataset",
"size": "the dataset bulk total size",
"type": "the type of the manifest",
"checksum": "the dataset bulk checksum",
"tier_class": "the dataset storage tier class"
},
"computed_size": "the computed dataset size",
"computed_size_date": "the date and time when the dataset size was computed",
"seismicmeta_guid": "the associated OSDU record id"
}
}
The SDMS V4 record (simplified)
{
"kind": "the osdu dataset kind",
"acl": {
"viewers": "list of entitlement groups with viewer rights",
"owners": "list of entitlement groups with admin rights",
},
"legal": {
"legaltags": "the list of legal tags",
"otherRelevantDataCountries": "the list of data countries",
"status": "the legal status"
},
"data": {
"Name": "the dataset name",
"Description": "the dataset description",
"TotalSize": "the dataset total size",
"DatasetProperties": {
"FileCollectionPath": "the dataset virtual folder path",
"FileSourceInfos": [
{
"FileSource": "the file component source",
"PreloadFilePath": "the file component origin",
"Name": "the file component name",
"FileSize": "the file component size",
"Checksum": "the file component checksum",
"ChecksumAlgorithm": "the checksum algorithm"
}
],
"Checksum": "the dataset checksum"
}
}
ADR symbols definitions
To make it simpler for the reader to understand the examples in the following sections, we define the following symbols:
Symbols | Description |
---|---|
RV3 | the SDMS V3 record |
RV4 | the SDMS V4 record |
RV4.DatasetProperties | the record_v4.data.DatasetProperties element |
RV4.FileSourceInfos | the record_v4.data.DatasetProperties.FileSourceInfos element |
The SDMS V3 record generation in detail
-
RV3.id
The ID in SDMS V3 is autogenerated based on the values composing the SDMS V3 URI:
tenant
,subproject
,path
andname
.hash_obj = hashlib.sha512() hash_obj.update((RV3.data.path + RV3.data.name).encode('utf-8')) hashed_value = hash_obj.hexdigest() cosmos_record["id"] = 'ds-' + RV3.data.tenant + '-' + RV3.data.subproject + '-' + hashed_value
-
RV3.data.name
The dataset name.
if 'Name' in RV4.data: RV3.data.name = RV4.data.Name elif len(FileSourceInfos) == 1 and 'Name' in FileSourceInfos[0] RV3.data.name = FileSourceInfos[0].Name else: RV3.data.name = RV4.id
-
RV3.data.tenant
The dataset tenant name matches the data-partition-id in the OSDU model. This specific information cannot be automatically detected in a V4 record but can be easily detected by the syncing process .
RV3.data.tenant = data_partition_id
-
RV3.data.subproject
The dataset resource group name (referred to as subproject in SDMS V3) must exist in SDMS V3 with the
access_policy
property set todataset
. Essentially, each partition in SDMS V3 should have a default data group where all SDMS V4 datasets can be collected. This required data group can be automatically created by the syncing process. The name of the data group will default tosyncv4
.RV3.data.subproject = "syncv4"
-
RV3.data.path
The dataset virtual path represents the logical folder structure in the data group (subproject) where the dataset is stored.
RV3.data.path = RV4.DatasetProperties.FileCollectionPath
-
RV3.data.acls
The Access Control List (ACL) defines the list of users with admin and viewer rights. The only difference is that in the SDMS V3 record, the
owners
list is namedadmins
, while theviewers
list has matching names.RV3.data.acls.admins = RV4.acls.owners RV3.data.acls.viewers = RV4.acls.viewers
-
RV3.data.ltag
In SDMS V3, legal tag information is represented by a unique value, whereas in SDMS V4, it is represented as a list. To simplify the record composition, we select the first valid legal tag from the V4 record list. If no valid legal tags are found in the V4 record, we should always set an invalid legal tag in V3. If this is not set, V3 will inherit a valid legal tag from the data group, risking the possibility of a non-accessible record in V4 being addressable in V3.
for tag in RV4.legal.legaltags: if isValid(tag): RV3.data.ltag = tag break if tag is None: RV3.data.ltag = RV4.legal.legaltags[0]
-
RV3.data.created-by
The user who created/ingested the dataset.
RV3.data['created-by'] = RV4.createUser
-
RV3.data.created_date
The timestamp when the dataset was created/ingested.
RV3.data.created_date = RV4.createTime
-
RV3.data.last_modified_date
The timestamp when the dataset was last modified.
RV3.data.last_modified_date = RV4.modifyTime
-
RV3.data.gcsurl
The storage ID of the container/bucket where dataset bulk files are stored. This value is automatically generated based on the record ID value.
hash_obj = hashlib.sha256() hash_obj.update(RV4.id.encode('utf-8')) RV3.data.gcsurl = hash_obj.hexdigest()[:-1]
-
RV3.data.ctag
The Coherency Tag (ctag) is a hash code associated with the dataset descriptor that changes every time the metadata is updated. This property exists only in SDMS V3, and it is autogenerated.
alphabet = string.ascii_letters + string.digits RV3.data.ctag = ''.join(secrets.choice(alphabet) for _ in range(16))
-
RV3.data.readonly
The
readonly
property defines the dataset's status regarding readability. If set tofalse
, the dataset can be accessed in both read and write modes. If set totrue
, the dataset can only be accessed in read mode. In SDMS V4, a dataset cannot be marked asreadonly
, and for this reason, in the generated V3 record, the value will be defaulted tofalse
.RV3.data.readonly = False
-
RV3.data.filemetadata
The
filemetadata
, also known as the dataset manifest, is an object containing information about how the dataset's bulks are stored in the cloud storage resource. The only supported manifest in SDMS V3 is theGENERIC
, which requires that all objects composing the dataset be saved in sequential order using the0
toN-1
naming convention, whereN
is the number of objects. The fields composing the dataset manifest are:nobjects
: the number of objects composing the dataset. this value can be computed by counting the number of objects composing the dataset.size
: the dataset total size can be computed by summing the sizes of all objects composing the dataset. Alternatively, if it exists, theRV4.data.TotalSize
can be used, but computing it will provide a better and clearer result.type
: the manifest type, withGENERIC
the only supported.checksum
: the dataset checksum.tier_class
: the dataset storage tiering class.blob_list = getBlobClient(connectionString) size = 0 tier_class = None objects_num = 0 error = False for blob in blob_list: if blob.name != str(count): error = True if tier_class == None: tier_class = blob.blob_tier objects_num = objects_num + 1 size = size + blob.size if not error: RV3.data.filemetadata.type = 'GENERIC' RV3.data.filemetadata.nobjects = objects_num RV3.data.filemetadata.size = size if 'Checksum' in RV4.DatasetProperties: RV3.data.filemetadata.checksum = RV4.DatasetProperties.Checksum RV3.data.filemetadata.tier_class = tier else: RV3.data.filemetadata = None
-
RV3.data.computed_size
The
computed_size
is generated by SDMS V3 when the/size
endpoint is triggered. This endpoint calculates the size of the datasets by summing the sizes of all composing objects. This field has been introduced because the dataset filemetadata object is an optional field created by client applications, such as sdapi or sdutil, and can only be trusted by them.blob_list = getBlobClient(connectionString) size = 0 for blob in blob_list: size = size + blob.size RV3.data.computed_size = size
-
RV3.data.computed_size_date
This is the timestamp of when the dataset size has been computed by SDMS V3.
RV3.data.computed_size_date = str(datetime.datetime.now())
-
RV3.data.seismicmeta_guid
The
seismicmeta_guid
is the ID of a record linked with the SDMS V3 dataset. This can be associated with the SDMS V4 record so all extra properties can be downloaded by consumer applications.RV3.data.seismicmeta_guid = RV4.id
The Script to validate the proposed conversion
-
The script sync-script.py is provided with this ADR (for testing purposes only) to demonstrate and validate the synching flow between SDMS V4 and V3:
- Create a random data file of 16MB and compute the checksum
- Fill an OSDU record and register it in SDMS V4
- Upload the 16MB file as 4 objects of 4MB each using the connection string generated via SDMS V4
- Generate an V3 metadata record and register it in SDMS V3
- Ensure the dataset in SDMS V3 can be located after ingestion
- Download all objects using the connection string generated via SDMS V3
- Compare the initial object with the download one to ensure these match
Example of an SDMS V4 ingested record
{
"id": "opendes:dataset--FileCollection.SEGY:7fe06451787641c4953a06a63e44967a",
"kind": "osdu:wks:dataset--FileCollection.SEGY: 1.1.0",
"version": 1694519237996696,
"acl": {
"viewers": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.viewer@opendes.domain.com"
],
"owners": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.admin@opendes.domain.com"
]
},
"legal": {
"legaltags": [
"ltag-seistore-test-01"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"modifyUser": "test-user@domain.com",
"modifyTime": "2023-09-07T11:47:18.625Z",
"createUser": "test-user@domain.com",
"createTime": "2023-09-07T07:17:58.443Z",
"data": {
"Name": "data-sync.segy",
"TotalSize": "16777216",
"Description": "SDMS synching test record",
"DatasetProperties": {
"FileCollectionPath": "/f1/f2/f3/",
"FileSourceInfos": [
{
"FileSource": "data-sync.segy",
"Name": "data-sync.segy",
"FileSize": "16777216",
"Checksum": "8ce2025f9b27e3017ab15f15b261d599",
"ChecksumAlgorithm": "MD5"
}
],
"Checksum": "8ce2025f9b27e3017ab15f15b261d599"
}
}
}
Example of a generated SDMS V3 metadata
{
"id": "ds-opendes-syncv4-c0699ac77bc64a5772ac7f6f455ce5a251e3686d87d26e91df2ecc73e7bfdf4b0a16ac757c2ec227c1a6814d097a0b6b759a01dc52753754a0a18dfaea53c7d0",
"data": {
"name": "data-sync.segy",
"tenant": "opendes",
"subproject": "syncv4",
"path": "/f1/f2/f3/",
"acls": {
"admins": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.admin@opendes.domain.com"
],
"viewers": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.viewer@opendes.domain.com"
]
},
"ltag": "ltag-seistore-test-01",
"created-by": "test-user@domain.com",
"created_date": "2023-09-07T07:17:58.443Z",
"last_modified_date": "2023-09-07T11:47:18.625Z",
"gcsurl": "a5993feef91df715c176452fe1a26d04ca70e88d0ccff268e92cd74c76dde61",
"ctag": "9STTAfiKl4iukKbp",
"readonly": "false",
"filemetadata": {
"nobjects": 4,
"size": 16777216,
"type": "GENERIC",
"checksum": "8ce2025f9b27e3017ab15f15b261d599",
"tier_class": "Hot"
},
"computed_size": 16777216,
"computed_size_date": "2023-09-12 13:47:45.877142",
"seismicmeta_guid": "opendes:dataset--FileCollection.SEGY:7fe06451787641c4953a06a63e44967a"
}
}
SDMS V4 to V3 Synching Automation
The preceding section explains the process of creating a metadata descriptor for SDMS V3 using an OSDU record. This metadata descriptor enables access to a dataset ingested in SDMS V4 through SDMS V3.
In order to automate the process, we will deploy a new service called the sdms-sync-service
, which will be responsible for generating an SDMS V3 record every time a new dataset is registered in SDMS V4. When a dataset is registered in SDMS V4, a message will be pushed into a Redis queue insert-synch-v4:{record-id}:{partition}:{other-required-params}
. The new service will consume the messages from the Redis queue and initiate the synching process:
- retrieve the OSDU record from storage service
- generates the corresponding SDMS V3 metadata descriptor
- saves the generated metadata in the SDMS V3 journal.
Details
-
If a dataset is patched in SDMS V4, the service should push an
insert
message into the Redis queue:-
If the previous
insert
message is still in the queue (not yet consumed by the sync service), the existing entry will be overwritten in the queue, and the sync service will create the updated one. -
If the previous version was already synced, when the new message is consumed, the updated record will be created, and because the generated key is identical, it will overwrite the existing record in the journal.
-
-
if a dataset is delete in SDMS V4 the service should push a
delete
message in the Redis queue.-
When the delete message is consumed, the sync service will generate only the V3 record key and remove the entry from the journal.
-
If the
insert
message was still not consumed from the queue, when the sync service consume it it should check if adelete
message is also present for the same record. In case this is located in the queue, the sync service will skip the sync process and remove both entryinsert
anddelete
from the Redis queue.
-
Limitations
When a dataset is registered in V4 via a client app, the record is created instantaneously, while uploading the bulk data into the storage resource takes longer. If the insert
message is consumed before the bulk data is uploaded, the file manifest cannot be computed due to missing objects. To address this issue, we can enable a background process in the sync-service
that loops over the created SDMS V3 records and updates the manifest in cases where it does not exist or when the last modified time in the corresponding SDMS V4 record is greater than the one reported in the V3 entry. This approach should be re-discussed with the community to find an optimal strategy to apply.