[ADR] Synching SDMS V3 datasets in SDMS V4
Introduction
We need a solution for make dataset ingested in SDMS V3 visible and consumed by SDMS V4.
The purpose of this ADR is to describes how to enable a synchronization mechanism that allows users of SDMS V4 to consume seismic dataset entities ingested in SDMS V3 via client applications, even though the two versions of the system have entirely different architectural logics.
Status
-
Initiated -
Proposed -
Under Review -
Approved -
Rejected
Problem statement
The Seismic Data Management Service V4 (SDMS V4) stores and manages data types as defined by the Open Subsurface Data Universe (OSDU) Authority. The APIs (Application Programming Interfaces) provide robust data type checks and are fully integrated with the OSDU policy service. The goal is to minimize ambiguity in the authorization model and facilitate straightforward adoption through a consistent usage pattern. In contrast, the V3 version of the service defines, saves, and manages proprietary metadata records, interacts directly with the entitlement service, and organizes records into collections/data-groups named subprojects.
The key difference between the two versions of the service lies in the way of how the cloud storage URI is generated. In SDMS V4 this is generated starting from the record-id value while in SDMS V3 the generated URI is a random UUID.
Proposed solution
Update SDMS V4 by adding the capability to correctly retrieve the storage location for the dataset's bulk data if the dataset was ingested via SDMS V3.
Scenarios
When a dataset is ingested in SDMS V3 from a seismic application, the latter also creates an OSDU Bulk record linked to a Work Product Component, as shown in the following diagram:
The seismic applications saves the SDMS V3 URI (also known as sdapth
) in the FileSourceInfo
property of the created OSDU Bulk record. This is done to later facilitate communication of the URI to SDMS V3 for retrieving the storage connection string required to access the dataset's bulk data.
Example of SDMS V3 dataset metadata
{
"name": "test-data.zgy",
"tenant": "partition",
"subproject": "subproject",
"path": "/",
"ltag": "test-legal",
"created_by": "test-user@slb.com",
"last_modified_date": "Tue Sep 12 2023 11:04:29 GMT+0000 (Coordinated Universal Time)",
"created_date": "Tue Sep 12 10:54:10 GMT+0000 (Coordinated Universal Time)",
"gcsurl": "ss-weu-xkz32bjwg2425gn/bdf36c8a-3c62-3151-12b7-227af4727520",
"ctag": "sMTz0oWeId1nOnrx",
"readonly": true,
"sbit": null,
"sbit_count": 0,
"filemetadata": {
"type": "GENERIC",
"size": 1544552448,
"nobjects": 47
},
"seismicmeta_guid": "partition:work-product-component--SeismicTraceData:326bac9a-1fb2-5c73-9c64-6ca122c5025a",
"access_policy": "uniform"
}
Example of OSDU storage associated Work Product Component
{
"id": "partition:work-product-component--SeismicTraceData:326bac9a-1fb2-5c73-9c64-6ca122c5025",
"kind": "osdu:wks:work-product-component--SeismicTraceData:1.3.0",
"version": 1685099234631439,
"acl": {
"viewers": [
"data.test@domain.slb.com"
],
"owners": [
"data.test@domain.com"
]
},
"legal": {
"legaltags": [
"test-legal"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"data": {
"BinGridID": "partition:work-product-component--SeismicBinGrid:2a714f2b12aa346d16a08c5a2f4e157e:",
"Datasets": [
"partition:dataset--FileCollection.Slb.OpenZGY:1de532c2-4d1b-5316-ba4a-422342321d55"
],
"DDMSDatasets": [
"urn:dataset--FileCollection.Slb.OpenZGY:1de532c2-4d1b-5316-ba4a-422342321d55"
],
"Name": "test-data.zgy",
"Source": "osdu",
"SubmitterName": "test-user@domain.com"
},
"createUser": "test-user@domain.com",
"createTime": "2023-09-12T11:04:30.321Z",
"modifyUser": "test-user@domain.com",
"modifyTime": "2023-09-12T18:09:12.703Z"
}
Example of OSDU storage associated File Collection
{
"id": "partition:dataset--FileCollection.Slb.OpenZGY:1de532c2-4d1b-5316-ba4a-422342321d55",
"version": "4426199321664216",
"kind": "osdu:wks:dataset--FileCollection.Slb.OpenZGY:1.0.0",
"acl": {
"viewers": [
"data.test@domain.slb.com"
],
"owners": [
"data.test@domain.com"
]
},
"legal": {
"legaltags": [
"test-legal"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"createUser": "test-user@domain.com",
"createTime": "2023-09-12T11:04:02.705Z",
"data": {
"Endian": "BIG",
"SEGYRevision": "rev 1",
"TotalSize": "1544552448",
"Name": "test-data.zgy",
"DatasetProperties": {
"FileCollectionPath": "sd://tenant/subproject/",
"FileSourceInfos": [
{
"FileSource": "test-data.zgy",
"Name": "test-data.zgy",
"FileSize": "1544552448",
}
]
}
}
}
Proposed Solution
To enable applications to access bulk datasets ingested in SDMS V3 through SDMS V4, we need to update the mechanism in SDMS V4 for retrieving the correct storage URI associated with the Bulk record. This update is necessary to generate a valid connection string for accessing the bulk data.
When a Bulk record is created, the SDMS V3 URI (also known as 'sdapth') is typically saved in the FileCollectionPath
and FileSource
properties. In the most common scenarios, the sd://tenant/subproject/path
portion of the URI is stored in the FileCollectionPath
property, while the URI's name is stored in the FileSource
property.
When a connection access string is requested for a Bulk record through SDMS V4, the service should detect if the record's file source type refers to a V3 dataset's URI. If this last case, the service should then:
-
extract the
subproject
name from theFileCollectionPath
subproject = record.data.DatasetProperties.FileCollectionPath.replace("sd://", "").split('/')[1]
-
extract the
path
from theFileCollectionPath
subproject = (record.data.DatasetProperties.FileCollectionPath.replace("sd://", "").split('/')[2:]).replace("//", "/")
-
extract the
name
from theFileSource
name = record.data.DatasetProperties.FileSourceInfos[0].FileSource
-
retrieve the storage URL from the V3 journal
SELECT c.data.gcsurl FROM c WHERE c.data.subproject="{subproject}" AND c.data.path="{path}" AND c.data.name="{name}"
-
generate the connection string using the retrieved storage URL
storage_client = StorageClient("{storage-url}") return storage_client.getConnectionString()
Notes
Seismic applications use different approaches to save the SDMS V3 URI in the Bulk record, and all these cases should be considered:
- The sd://tenant/subproject/path is saved in the
FileCollectionPath
, and the name is saved inFileSource
. - The full sd://tenant/subproject/path/name URI is saved in both
FileCollectionPath
andFileSource
. - The sd://tenant/subproject/path URI is saved in
FileCollectionPath
, and the name inFileSource
, but this latter starts with the ./ special character (which should be removed).
Limitations
Applications that do not match the described flow should we reviewed with the application owner before defining the right strategy to enable the synchronization of datasets ingested in SDMS V3 with SDSM V4.