[ADR] Synching SDMS V4 datasets in SDMS V3

Introduction

We need a solution for make dataset ingested in SDMS V4 visible and consumed by SDMS V3.

The purpose of this ADR is to describes how to enable a synchronization mechanism that allows users of SDMS V3 to consume seismic dataset entities ingested in SDMS V4, even though the two versions of the system have entirely different architectural logics.

Status

Problem statement

The Seismic Data Management Service V4 (SDMS V4) stores and manages data types as defined by the Open Subsurface Data Universe (OSDU) Authority. The APIs (Application Programming Interfaces) provide robust data type checks and are fully integrated with the OSDU policy service. The goal is to minimize ambiguity in the authorization model and facilitate straightforward adoption through a consistent usage pattern. In contrast, the V3 version of the service defines, saves, and manages proprietary metadata records, interacts directly with the entitlement service, and organizes records into collections/data-groups named subprojects.

The key difference between the two versions of the service lies in the form of the record. In the case of the OSDU record adopted by SDMS V4, it is entirely managed by the storage service. However, the V3 metadata has its own format, and to locate a dataset ingested in SDMS V4 via V3, it is necessary to create a V3 proprietary record. The following section will describe how an OSDU record can be translated into a V3 record to enable the synchronization process between the systems

Proposed solution

Create a new service capable of detecting when a new dataset is registered in SDMS V4 and creating the corresponding record in SDMS V3

Overview

As previously noted, in SDMS V3, the dataset descriptor has a proprietary structure and is maintained in an internal catalog. However, in SDMS V4, the descriptor is a standard OSDU record managed by the storage service. To make a datasets, ingested in SDMS V4, visible in SDMS V3 we must create a corresponding V3 metadata. This section describes how an SDMS V3 record can be created, using the OSDU record details, to make the ingested dataset in V4 visible in V3

The SDMS V3 dataset descriptor

{
    "id": "the record id <used as key in the service journal catalogue>",
    "data": {
        "name": "the dataset name",
        "tenant": "the tenant name",
        "subproject": "the subproject name",
        "path": "the dataset virtual folder path",
        "acls": {
            "admins": "list of entitlement groups with admin rights",
            "viewers": "list of entitlement groups with viewer rights"
        },
        "ltag": "the associated legal tag",
        "created-by": "the id of the user who ingested the dataset",
        "created_date": "the date and time when the dataset was ingested",
        "last_modified_date": "the date and time when the dataset was last modified",
        "gcsurl": "the storage uri string where bulks are saved",
        "ctag": "a coherency hash tag that changes every time this record is modified",
        "readonly": "the access mode level",
        "filemetadata": {
            "nobjects": "the number of blobs composing the dataset",
            "size": "the dataset bulk total size",
            "type": "the type of the manifest",
            "checksum": "the dataset bulk checksum",
            "tier_class": "the dataset storage tier class"
        },
        "computed_size": "the computed dataset size",
        "computed_size_date": "the date and time when the dataset size was computed",
        "seismicmeta_guid": "the associated OSDU record id"
    }
}

The SDMS V4 record (simplified)

{
    "kind": "the osdu dataset kind",
    "acl": {
        "viewers": "list of entitlement groups with viewer rights",
        "owners": "list of entitlement groups with admin rights",
    },
    "legal": {
        "legaltags": "the list of legal tags",
        "otherRelevantDataCountries": "the list of data countries",
        "status": "the legal status"
    },
    "data": {
      "Name": "the dataset name",
      "Description": "the dataset description",
      "TotalSize": "the dataset total size",
      "DatasetProperties": {
        "FileCollectionPath": "the dataset virtual folder path",
        "FileSourceInfos": [
            {
                "FileSource": "the file component source",
                "PreloadFilePath": "the file component origin",
                "Name": "the file component name",
                "FileSize": "the file component size",
                "Checksum": "the file component checksum",
                "ChecksumAlgorithm": "the checksum algorithm"
            }
        ],
        "Checksum": "the dataset checksum"
    }
}

ADR symbols definitions

To make it simpler for the reader to understand the examples in the following sections, we define the following symbols:

Symbols	Description
RV3	the SDMS V3 record
RV4	the SDMS V4 record
RV4.DatasetProperties	the record_v4.data.DatasetProperties element
RV4.FileSourceInfos	the record_v4.data.DatasetProperties.FileSourceInfos element

The SDMS V3 record generation in detail

RV3.id

The ID in SDMS V3 is autogenerated based on the values composing the SDMS V3 URI: tenant, subproject, path and name.

hash_obj = hashlib.sha512()
hash_obj.update((RV3.data.path + RV3.data.name).encode('utf-8'))
hashed_value = hash_obj.hexdigest()
cosmos_record["id"] = 'ds-' + RV3.data.tenant + '-' + RV3.data.subproject + '-' + hashed_value

RV3.data.name

The dataset name.

if 'Name' in RV4.data:
    RV3.data.name = RV4.data.Name
elif len(FileSourceInfos) == 1 and 'Name' in FileSourceInfos[0]
    RV3.data.name = FileSourceInfos[0].Name
else:
    RV3.data.name = RV4.id

RV3.data.tenant

The dataset tenant name matches the data-partition-id in the OSDU model. This specific information cannot be automatically detected in a V4 record but can be easily detected by the syncing process .
```
RV3.data.tenant = data_partition_id
```
RV3.data.subproject

The dataset resource group name (referred to as subproject in SDMS V3) must exist in SDMS V3 with the access_policy property set to dataset. Essentially, each partition in SDMS V3 should have a default data group where all SDMS V4 datasets can be collected. This required data group can be automatically created by the syncing process. The name of the data group will default to syncv4.
```
RV3.data.subproject = "syncv4"
```
RV3.data.path

The dataset virtual path represents the logical folder structure in the data group (subproject) where the dataset is stored.
```
RV3.data.path = RV4.DatasetProperties.FileCollectionPath
```
RV3.data.acls

The Access Control List (ACL) defines the list of users with admin and viewer rights. The only difference is that in the SDMS V3 record, the owners list is named admins, while the viewers list has matching names.
```
RV3.data.acls.admins = RV4.acls.owners
RV3.data.acls.viewers = RV4.acls.viewers
```
RV3.data.ltag

In SDMS V3, legal tag information is represented by a unique value, whereas in SDMS V4, it is represented as a list. To simplify the record composition, we select the first valid legal tag from the V4 record list. If no valid legal tags are found in the V4 record, we should always set an invalid legal tag in V3. If this is not set, V3 will inherit a valid legal tag from the data group, risking the possibility of a non-accessible record in V4 being addressable in V3.
```
for tag in RV4.legal.legaltags:
    if isValid(tag):
        RV3.data.ltag = tag
        break
if tag is None:
    RV3.data.ltag = RV4.legal.legaltags[0]
```
RV3.data.created-by

The user who created/ingested the dataset.
```
RV3.data['created-by'] = RV4.createUser
```
RV3.data.created_date

The timestamp when the dataset was created/ingested.
```
RV3.data.created_date = RV4.createTime
```
RV3.data.last_modified_date

The timestamp when the dataset was last modified.
```
RV3.data.last_modified_date = RV4.modifyTime
```
RV3.data.gcsurl

The storage ID of the container/bucket where dataset bulk files are stored. This value is automatically generated based on the record ID value.
```
hash_obj = hashlib.sha256()
hash_obj.update(RV4.id.encode('utf-8'))
RV3.data.gcsurl = hash_obj.hexdigest()[:-1]
```
RV3.data.ctag

The Coherency Tag (ctag) is a hash code associated with the dataset descriptor that changes every time the metadata is updated. This property exists only in SDMS V3, and it is autogenerated.
```
alphabet = string.ascii_letters + string.digits
RV3.data.ctag = ''.join(secrets.choice(alphabet) for _ in range(16))
```
RV3.data.readonly

The readonly property defines the dataset's status regarding readability. If set to false, the dataset can be accessed in both read and write modes. If set to true, the dataset can only be accessed in read mode. In SDMS V4, a dataset cannot be marked as readonly, and for this reason, in the generated V3 record, the value will be defaulted to false.
```
RV3.data.readonly = False
```
RV3.data.filemetadata

The filemetadata, also known as the dataset manifest, is an object containing information about how the dataset's bulks are stored in the cloud storage resource. The only supported manifest in SDMS V3 is the GENERIC, which requires that all objects composing the dataset be saved in sequential order using the 0 to N-1 naming convention, where N is the number of objects. The fields composing the dataset manifest are:

nobjects: the number of objects composing the dataset. this value can be computed by counting the number of objects composing the dataset.

size: the dataset total size can be computed by summing the sizes of all objects composing the dataset. Alternatively, if it exists, the RV4.data.TotalSize can be used, but computing it will provide a better and clearer result.

type: the manifest type, with GENERIC the only supported.

checksum: the dataset checksum.

tier_class: the dataset storage tiering class.
```
blob_list = getBlobClient(connectionString)
size = 0
tier_class = None
objects_num = 0
error = False
for blob in blob_list:
    if blob.name != str(count):
        error = True
    if tier_class == None:
        tier_class = blob.blob_tier
    objects_num = objects_num + 1
    size = size + blob.size

if not error:
    RV3.data.filemetadata.type = 'GENERIC'
    RV3.data.filemetadata.nobjects = objects_num
    RV3.data.filemetadata.size = size
    if 'Checksum' in RV4.DatasetProperties:
        RV3.data.filemetadata.checksum = RV4.DatasetProperties.Checksum
    RV3.data.filemetadata.tier_class = tier
else:
    RV3.data.filemetadata = None
```
RV3.data.computed_size

The computed_size is generated by SDMS V3 when the /size endpoint is triggered. This endpoint calculates the size of the datasets by summing the sizes of all composing objects. This field has been introduced because the dataset filemetadata object is an optional field created by client applications, such as sdapi or sdutil, and can only be trusted by them.
```
blob_list = getBlobClient(connectionString)
size = 0
for blob in blob_list:
    size = size + blob.size
RV3.data.computed_size = size
```
RV3.data.computed_size_date

This is the timestamp of when the dataset size has been computed by SDMS V3.
```
RV3.data.computed_size_date = str(datetime.datetime.now())
```
RV3.data.seismicmeta_guid

The seismicmeta_guid is the ID of a record linked with the SDMS V3 dataset. This can be associated with the SDMS V4 record so all extra properties can be downloaded by consumer applications.
```
RV3.data.seismicmeta_guid = RV4.id
```

The Script to validate the proposed conversion

The script sync-script.py is provided with this ADR (for testing purposes only) to demonstrate and validate the synching flow between SDMS V4 and V3:
- Create a random data file of 16MB and compute the checksum
- Fill an OSDU record and register it in SDMS V4
- Upload the 16MB file as 4 objects of 4MB each using the connection string generated via SDMS V4
- Generate an V3 metadata record and register it in SDMS V3
- Ensure the dataset in SDMS V3 can be located after ingestion
- Download all objects using the connection string generated via SDMS V3
- Compare the initial object with the download one to ensure these match

Example of an SDMS V4 ingested record

{
    "id": "opendes:dataset--FileCollection.SEGY:7fe06451787641c4953a06a63e44967a",
    "kind": "osdu:wks:dataset--FileCollection.SEGY: 1.1.0",
    "version": 1694519237996696,
    "acl": {
        "viewers": [
            "data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.viewer@opendes.domain.com"
        ],
        "owners": [
            "data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.admin@opendes.domain.com"
        ]
    },
    "legal": {
        "legaltags": [
            "ltag-seistore-test-01"
        ],
        "otherRelevantDataCountries": [
            "US"
        ],
        "status": "compliant"
    },
    "modifyUser": "test-user@domain.com",
    "modifyTime": "2023-09-07T11:47:18.625Z",
    "createUser": "test-user@domain.com",
    "createTime": "2023-09-07T07:17:58.443Z",
    "data": {
        "Name": "data-sync.segy",
        "TotalSize": "16777216",
        "Description": "SDMS synching test record",
        "DatasetProperties": {
            "FileCollectionPath": "/f1/f2/f3/",
            "FileSourceInfos": [
                {
                    "FileSource": "data-sync.segy",
                    "Name": "data-sync.segy",
                    "FileSize": "16777216",
                    "Checksum": "8ce2025f9b27e3017ab15f15b261d599",
                    "ChecksumAlgorithm": "MD5"
                }
            ],
            "Checksum": "8ce2025f9b27e3017ab15f15b261d599"
        }
    }
}

Example of a generated SDMS V3 metadata

{
    "id": "ds-opendes-syncv4-c0699ac77bc64a5772ac7f6f455ce5a251e3686d87d26e91df2ecc73e7bfdf4b0a16ac757c2ec227c1a6814d097a0b6b759a01dc52753754a0a18dfaea53c7d0",
    "data": {
        "name": "data-sync.segy",
        "tenant": "opendes",
        "subproject": "syncv4",
        "path": "/f1/f2/f3/",
        "acls": {
            "admins": [
                "data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.admin@opendes.domain.com"
            ],
            "viewers": [
                "data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.viewer@opendes.domain.com"
            ]
        },
        "ltag": "ltag-seistore-test-01",
        "created-by": "test-user@domain.com",
        "created_date": "2023-09-07T07:17:58.443Z",
        "last_modified_date": "2023-09-07T11:47:18.625Z",
        "gcsurl": "a5993feef91df715c176452fe1a26d04ca70e88d0ccff268e92cd74c76dde61",
        "ctag": "9STTAfiKl4iukKbp",
        "readonly": "false",
        "filemetadata": {
            "nobjects": 4,
            "size": 16777216,
            "type": "GENERIC",
            "checksum": "8ce2025f9b27e3017ab15f15b261d599",
            "tier_class": "Hot"
        },
        "computed_size": 16777216,
        "computed_size_date": "2023-09-12 13:47:45.877142",
        "seismicmeta_guid": "opendes:dataset--FileCollection.SEGY:7fe06451787641c4953a06a63e44967a"
    }
}

SDMS V4 to V3 Synching Automation

The preceding section explains the process of creating a metadata descriptor for SDMS V3 using an OSDU record. This metadata descriptor enables access to a dataset ingested in SDMS V4 through SDMS V3.

In order to automate the process, we will deploy a new service called the sdms-sync-service, which will be responsible for generating an SDMS V3 record every time a new dataset is registered in SDMS V4. When a dataset is registered in SDMS V4, a message will be pushed into a Redis queue insert-synch-v4:{record-id}:{partition}:{other-required-params}. The new service will consume the messages from the Redis queue and initiate the synching process:

retrieve the OSDU record from storage service
generates the corresponding SDMS V3 metadata descriptor
saves the generated metadata in the SDMS V3 journal.

Details

If a dataset is patched in SDMS V4, the service should push an insert message into the Redis queue:
- If the previous insert message is still in the queue (not yet consumed by the sync service), the existing entry will be overwritten in the queue, and the sync service will create the updated one.
- If the previous version was already synced, when the new message is consumed, the updated record will be created, and because the generated key is identical, it will overwrite the existing record in the journal.
if a dataset is delete in SDMS V4 the service should push a delete message in the Redis queue.
- When the delete message is consumed, the sync service will generate only the V3 record key and remove the entry from the journal.
- If the insert message was still not consumed from the queue, when the sync service consume it it should check if a delete message is also present for the same record. In case this is located in the queue, the sync service will skip the sync process and remove both entry insert and delete from the Redis queue.

Limitations

When a dataset is registered in V4 via a client app, the record is created instantaneously, while uploading the bulk data into the storage resource takes longer. If the insert message is consumed before the bulk data is uploaded, the file manifest cannot be computed due to missing objects. To address this issue, we can enable a background process in the sync-service that loops over the created SDMS V3 records and updates the manifest in cases where it does not exist or when the last modified time in the corresponding SDMS V4 record is greater than the one reported in the V3 entry. This approach should be re-discussed with the community to find an optimal strategy to apply.

Edited Sep 13, 2023 by Diego Molteni