seismic-dms-service issueshttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues2024-03-19T14:59:18Zhttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/130IBM E2E tests fail2024-03-19T14:59:18ZDaniel PerezIBM E2E tests failE2E tests for IBM in SDMS V3 are failing with no healthy upstream, this seems to be an issue with environment itself.E2E tests for IBM in SDMS V3 are failing with no healthy upstream, this seems to be an issue with environment itself.Anuj GuptaIsha KumariAnuj Guptahttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/129DATASET SELECT LS POST: while putting invalid characters in select it is givi...2024-02-29T12:14:56ZIsha KumariDATASET SELECT LS POST: while putting invalid characters in select it is giving response code 200. it should give 400 DATASET SELECT LS POST: while putting invalid characters in selectit is giving response code 200. it should give 400 DATASET SELECT LS POST: while putting invalid characters in selectit is giving response code 200. it should give 400https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/128Subproject creation accepts non-existing groups in ACLs2024-02-26T17:21:16ZYan Sushchynski (EPAM)Subproject creation accepts non-existing groups in ACLs## Description of the problem
There is an issue when it is possible to create a new subproject with non-existing groups in the `acls` field. And then, any action, except deleting the subproject, throws `403` in the subproject.
## Steps ...## Description of the problem
There is an issue when it is possible to create a new subproject with non-existing groups in the `acls` field. And then, any action, except deleting the subproject, throws `403` in the subproject.
## Steps to reproduce it
1. Create a new subproject with invalid acls:
```
curl --location --request POST 'https://<svc_url>/v3/subproject/tenant/osdu/subproject/test-123' \
--header 'x-api-key: {{SVC_API_KEY}}' \
--header 'Content-Type: application/json' \
--header 'ltag: osdu-demo-legaltag' \
--header 'appkey: {{DE_APP_KEY}}' \
--header 'Authorization: Bearer <token>' \
--data-raw '{
"storage_class": "REGIONAL",
"storage_location": "US-CENTRAL1",
"acls": {
"admins": [
"data.sdms.non-existing.admin@osdu.group"
],
"viewers": [
"data.sdms.non-existing.viewer@osdu.group"
]
}
}'
```
This request is executed without any error.
2. Try to upload any file to the subproject:
```shell
python sdutil cp somefile sd://osdu/test-123/somefile
```
Output:
```
[403] [seismic-store-service] User not authorized to perform this operation
```Diego MolteniSacha BrantsDiego Moltenihttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/127Issue with Get Status API2024-02-09T18:12:56ZJiman KimIssue with Get Status APIHello we are running some authentication testing and are running into some behaviors that may or may not be a bug.
for this endpoint
/seistore-svc/api/v4/status
We have 3 tests running
1. Sends an invalid token
2. Sends a valid toke...Hello we are running some authentication testing and are running into some behaviors that may or may not be a bug.
for this endpoint
/seistore-svc/api/v4/status
We have 3 tests running
1. Sends an invalid token
2. Sends a valid token but signed with a wrong secret
3. Sends the HTTP request without an authorization header.
1,2 return a 401
but 3 returns 200.
Is this a bug or intended behavior?
Thank you!M21 - Release 0.24https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/125Patch dataset name issue2024-01-22T16:21:23ZYan Sushchynski (EPAM)Patch dataset name issueWe ran the [collection](https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M22/GC-M22/GC_OSDU_Smoke_Tests.postman_collection.json?ref_type=heads), and this request
```bash
curl --location --request PATCH 'https:/...We ran the [collection](https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M22/GC-M22/GC_OSDU_Smoke_Tests.postman_collection.json?ref_type=heads), and this request
```bash
curl --location --request PATCH 'https://<host>/api/seismic-store/v3/dataset/tenant/m19/subproject/subprojectodi374308/dataset/AutoTest_dsetodi831125?path=autotest_path' \
--header 'Content-Type: application/json' \
--header 'data-partition-id: m19' \
--header 'Authorization: Bearer token' \
--data '{
"dataset_new_name": "autotest_new",
"metadata": {
"f1": "v1",
"f2": "v2",
"f3": "v3"
},
"filemetadata": {
"f1": "v1",
"f2": "v2",
"f3": "v3"
},
"last_modified_date": "Thu Jul 16 2020 04:37:41 GMT+0000 (Coordinated Universal Time)",
"gtags": [
"tag01",
"tag02",
"tag03"
],
"ltag": "m19-SeismicDMS-Legal-Tag-Test7649172",
"readonly": false,
"seismicmeta": {
"kind": "m19:seistore:seismic2d:1.0.0",
"legal": {
"legaltags": [
"m19-SeismicDMS-Legal-Tag-Test7649172"
],
"otherRelevantDataCountries": [
"US"
]
},
"data": {
"msg": "Auto Test sample data patched"
}
}
}'
```
And, we get the following error:
```bash
[seismic-store-service] The dataset sd://m19/subprojectodi374308/autotest_path/autotest_new already exists, even so there is no such a dataset in Seismic at the moment
```https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/123[IBM] replace keycloak-admin with @keycloak/keycloak-admin-client2024-03-05T01:08:14ZDiego Molteni[IBM] replace keycloak-admin with @keycloak/keycloak-admin-clientplease replace the deprecated and vulnerable package [keycloak-admin](https://www.npmjs.com/package/keycloak-admin) with the new [@keycloak/keycloak-admin-client](https://www.npmjs.com/package/@keycloak/keycloak-admin-client)please replace the deprecated and vulnerable package [keycloak-admin](https://www.npmjs.com/package/keycloak-admin) with the new [@keycloak/keycloak-admin-client](https://www.npmjs.com/package/@keycloak/keycloak-admin-client)M23 - Release 0.26Isha KumariIsha Kumarihttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/122V4 API and Postman Collection showcasing the steps/sequence2024-01-11T17:24:56ZDebasis ChatterjeeV4 API and Postman Collection showcasing the steps/sequenceWe started to look at the collection provided by AWS
https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M22/AWS-M22/DDMS%20Seismic/AWS_OSDUR3M22_Seismic_v4_Automated.postman_collection.json
This was apparently cre...We started to look at the collection provided by AWS
https://community.opengroup.org/osdu/platform/pre-shipping/-/blob/main/R3-M22/AWS-M22/DDMS%20Seismic/AWS_OSDUR3M22_Seismic_v4_Automated.postman_collection.json
This was apparently created with initial example from Dev team (Seismic DDMS).
We are a little unclear about the logical sequence and naming of the folder/requests.
Folder "Schema" is really to create some catalog record (Dataset FileCollection.SegY).
Folder "Connection" is apparently to upload some data files". Should this not be before we can create Dataset record?
Something similar to what we see here, as the sequence of steps.
![image](/uploads/e1579cc87851b5e8995c0892dde824f7/image.png)
Is the need for sdutil completely eliminated? Earlier, we had to upload data file (SegY) by using sdutil to suitable tenant and sub-project.
Perhaps a **companion document** with the **Postman Collection** would help.
@chad earlier mentioned that the DEV team would probably provide a video showing the steps?
Thank you
cc @spoddar , @kimjiman and @ydzenghttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/121[SAST] Client_Privacy_Violation in file queue.ts2023-11-13T15:53:55ZYauhen Shaliou [EPAM/GCP][SAST] Client_Privacy_Violation in file queue.ts**Description**
Method setup at line 42 of \\seismic-store-service\\app\\sdms\\src\\cloud\\shared\\queue.ts sends user information outside the application. This may constitute a Privacy Violation.
<table>
<tr>
<th> </th>
<th>Source</th...**Description**
Method setup at line 42 of \\seismic-store-service\\app\\sdms\\src\\cloud\\shared\\queue.ts sends user information outside the application. This may constitute a Privacy Violation.
<table>
<tr>
<th> </th>
<th>Source</th>
<th>Destination</th>
</tr>
<tr>
<th>File</th>
<td>seismic-store-service/app/sdms/src/cloud/shared/queue.ts</td>
<td>seismic-store-service/app/sdms/src/cloud/providers/azure/insights.ts</td>
</tr>
<tr>
<th>Line number</th>
<td>42</td>
<td>129</td>
</tr>
<tr>
<th>Object</th>
<td>password</td>
<td>log</td>
</tr>
<tr>
<th>Code line</th>
<td>redisOptions.password = cacheParams.KEY;</td>
<td>console.log(data);</td>
</tr>
</table>https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/120[SAST] SSL_Verification_Bypass in file cosmosdb.ts2023-11-13T15:22:49ZYauhen Shaliou [EPAM/GCP][SAST] SSL_Verification_Bypass in file cosmosdb.ts# **Location:**
<table>
<tr>
<th> </th>
<th>
</th>
<th>Destination</th>
</tr>
<tr>
<th>File</th>
<td>
</td>
<td>seismic-store-service/app/sdms/src/cloud/providers/azure/cosmosdb.ts</td>
</tr>
<tr>
<th>Line number</th>
<td>
</td>
<td>...# **Location:**
<table>
<tr>
<th> </th>
<th>
</th>
<th>Destination</th>
</tr>
<tr>
<th>File</th>
<td>
</td>
<td>seismic-store-service/app/sdms/src/cloud/providers/azure/cosmosdb.ts</td>
</tr>
<tr>
<th>Line number</th>
<td>
</td>
<td>67</td>
</tr>
<tr>
<th>Object</th>
<td>
</td>
<td>rejectUnauthorized</td>
</tr>
<tr>
<th>Code line</th>
<td>
</td>
<td>rejectUnauthorized: false</td>
</tr>
</table>
**Description**
\\seismic-store-service\\app\\sdms\\src\\cloud\\providers\\azure\\cosmosdb.ts relies HTTPS requests, in constructor. The rejectUnauthorized parameter, at line 67, effectively disables verification of the SSL certificate trust chain.
JavaScript Explicitly Disabling Certificate Verification var https = require('https'); var options = { hostname: 'domain.com', port: 443, path: '/', method: 'GET', rejectUnauthorized: false; }; options.agent = new https.Agent(options); var req = https.request(options, function(res) { res.on('data', function(d) { handleRequest(d); }); }); req.end();https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/119Rename "IStorage" methods for v42023-10-24T09:09:14ZYan Sushchynski (EPAM)Rename "IStorage" methods for v4Hello,
I noticed that the cloud-storage [interface](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/blob/master/app/sdms-v4/src/cloud/storage.ts?ref_type=heads#L1...Hello,
I noticed that the cloud-storage [interface](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/blob/master/app/sdms-v4/src/cloud/storage.ts?ref_type=heads#L19) has the following methods:
```
createBucket(bucketName: string): Promise<void>;
bucketExists(bucketName: string): Promise<boolean>;
deleteBucket(bucketName: string): Promise<void>;
```
These method names suggest that new buckets are getting created, checked for existence, or deleted within a single data-partition. However, the GC and Baremetal implementations are different -- a data-partition is expected to work with its own pre-created bucket instead of creating new ones. This discrepancy between the method names and their actual functionality could lead to confusion and misunderstanding.
A similar situation exists in the AWS implementation, where comments had to be added to clarify that 'bucketNames' are actually BLOBs, which can be seen [here](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/blob/master/app/sdms-v4/src/cloud/providers/aws/storage.ts?ref_type=heads#L45).
I propose that we consider renaming these methods to more accurately reflect their functionality and create a better alignment with the actual implementation.
Thank you.Diego MolteniYunhua KoglinSacha BrantsMark YanDiego Moltenihttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/117[ADR] Advanced filters for dataset search2023-12-04T14:23:23ZAlexandre Gattiker[ADR] Advanced filters for dataset search# Introduction
We need additional filtering support to be able to filter the `POST /dataset/tenant/{tenantid}/subproject/{subprojectid}` and `PUT /operation/bulk-delete` (added in [!891](https://community.opengroup.org/osdu/platform/dom...# Introduction
We need additional filtering support to be able to filter the `POST /dataset/tenant/{tenantid}/subproject/{subprojectid}` and `PUT /operation/bulk-delete` (added in [!891](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/merge_requests/891/diffs#fafb01a8314993d61fca390beef912c7813278eb)) operations by metadata fields with more complex expressions than a single key-value match.
# Status
* [x] Initiated
* [x] Proposed
* [x] Under Review
* [ ] Approved
* [ ] Rejected
# Problem statement
The SDMS API `POST /dataset/tenant/{tenantid}/subproject/{subprojectid}` currently accepts the following body parameters, among others:
* `search`, a single SQL-like search parameter, for example: `search=name=file%`
* `gtags`, an array of strings matching tags associated with dataset metadata.
The `search` field does not support more than one field, or more than one possible value for a field.
The SDMS API `PUT /operation/bulk-delete` (added in [!891](https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/merge_requests/891/diffs#fafb01a8314993d61fca390beef912c7813278eb)) requires a `path` parameter containing `tenantid`, `subprojectid` and `path` but does not support filtering by metadata fields or tags.
For both search and delete, we need to be able to filter by more than one field, or more than one possible value for a field.
Furthermore, we expect a need for more complex filter solutions, such as combining `AND`, `OR` and `NOT` operators. The proposed solution should ideally be extensible to support additional expressions and operators in the future if needed.
# Proposed solution
Add an optional `filter` parameter to the `POST /dataset/tenant/{tenantid}/subproject/{subprojectid}` and `PUT /operation/bulk-delete` API endpoints.
The `search` and `gtags` parameters are to be deprecated.
## Overview
The `filter` parameter can take a payload with a variable format, allowing expressing a simple filter on a single field, as well as logical combinations of filters with arbitrary complexity.
The `POST /dataset/tenant/{tenantid}/subproject/{subprojectid}` operation has been selected for extension because:
* Advanced metadata filtering, encompassing select and search functionalities, has already been incorporated into that operation.
* The SDMS API also accepts the `GET` method for the operation with parameters provided in the query string, as a legacy endpoint. The `POST` version of the endpoints has been introduced to address issues related to handling large request parameters, where sending the cursor as a query parameter can lead to oversized requests and subsequent failures.
## Examples
Example value for the `filter` parameter:
```json
{
"and": [
{
"not": {
"property": "gtags",
"operator": "CONTAINS",
"value": "tagA"
}
},
{
"or": [
{
"property": "name",
"operator": "LIKE",
"value": "test.%"
},
{
"property": "name",
"operator": "=",
"value": "dataset.sgy"
}
]
}
]
}
```
This is equivalent to the following pseudo-SQL statement:
```sql
SELECT * FROM datasets d WHERE
NOT (EXISTS (SELECT VALUE 1 FROM t IN d.data.gtags WHERE t = 'tagA')
OR (IS_STRING(d.data.gtags) AND STRINGEQUALS(d.data.gtags, 'tagA')))
AND (
d.name LIKE 'test.%'
OR d.name = 'dataset.sgy'
)
```
## Details
The `filter` parameter can be:
* A **property match filter**:
```json
{
"property": "...",
"operator": "...",
"value": "..."
}
```
The implementation will be extensible with additional keys if needed in the future, e.g. to specify case sensitivity.
* An **`and` or `or` filter**, i.e. an object containing only the key `and` or `or`, of which the value is an array of one or more filters (i.e. a property match filter or an `and`, `or` or `not` filter)
```json
{
"and": [...]
}
```
* A **`not` filter**, i.e. an object containing only the key `not`, of which the value is a filter (i.e. a property match filter or an `and`, `or` or `not` filter)
```json
{
"not": ...
}
```
# Out of scope / limitations
The operations at `GET /utility/ls` and `POST /utility/ls` can also be used for retrieving datasets, but will not be extended with advanced filtering at the moment. That functionality can be added later if required.Diego MolteniDiego Moltenihttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/116delete user returns 400 on success instead of 2002023-10-03T17:41:07ZZachary Keirndelete user returns 400 on success instead of 200The delete user from subproject endpoint (observed in m18/AWS) returns 400 even though the delete completes successfully. Then if you run it again it will correctly return 404.The delete user from subproject endpoint (observed in m18/AWS) returns 400 even though the delete completes successfully. Then if you run it again it will correctly return 404.https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/115path and sdpath not used consistently, error in /user with parameter path2023-09-27T19:34:47ZZachary Keirnpath and sdpath not used consistently, error in /user with parameter pathThere are I think two issues here. One is documentation in that the yaml doc for /user delete option has 'path' instead of 'sdpath' and I believe it should be 'sdpath'. The other is that when I try to delete someone that does not exist,...There are I think two issues here. One is documentation in that the yaml doc for /user delete option has 'path' instead of 'sdpath' and I believe it should be 'sdpath'. The other is that when I try to delete someone that does not exist, I get 400 instead of 404 in response. This is regardless of whether I try 'path' or 'sdpath' for the parameter.https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/114Implement dataset storage for IBM2023-09-20T02:17:59ZMark YanImplement dataset storage for IBMhttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/113Implement dataset storage for GCP2023-09-20T02:17:21ZMark YanImplement dataset storage for GCPhttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/112[ADR] Synching SDMS V3 datasets in SDMS V42024-02-28T07:31:26ZDiego Molteni[ADR] Synching SDMS V3 datasets in SDMS V4# Introduction
We need a solution for make dataset ingested in SDMS V3 visible and consumed by SDMS V4.
The purpose of this ADR is to describes how to enable a synchronization mechanism that allows users of SDMS V4 to consume seismic d...# Introduction
We need a solution for make dataset ingested in SDMS V3 visible and consumed by SDMS V4.
The purpose of this ADR is to describes how to enable a synchronization mechanism that allows users of SDMS V4 to consume seismic dataset entities ingested in SDMS V3 via client applications, even though the two versions of the system have entirely different architectural logics.
# Status
* [x] Initiated
* [x] Proposed
* [ ] Under Review
* [ ] Approved
* [ ] Rejected
# Problem statement
The Seismic Data Management Service V4 (SDMS V4) stores and manages data types as defined by the Open Subsurface Data Universe (OSDU) Authority. The APIs (Application Programming Interfaces) provide robust data type checks and are fully integrated with the OSDU policy service. The goal is to minimize ambiguity in the authorization model and facilitate straightforward adoption through a consistent usage pattern. In contrast, the V3 version of the service defines, saves, and manages proprietary metadata records, interacts directly with the entitlement service, and organizes records into collections/data-groups named subprojects.
<div align="center">
<br/><img src="/uploads/5e1a58219ca35be9da530b0eba2ed9fa/arch-diagram.png"
alt="sdms-architectural-diagram"
style="display: block; margin: 0 auto;"/><br/>
</div>
The key difference between the two versions of the service lies in the way of how the cloud storage URI is generated. In SDMS V4 this is generated starting from the record-id value while in SDMS V3 the generated URI is a random UUID.
# Proposed solution
Update SDMS V4 by adding the capability to correctly retrieve the storage location for the dataset's bulk data if the dataset was ingested via SDMS V3.
## Scenarios
When a dataset is ingested in SDMS V3 from a seismic application, the latter also creates an OSDU Bulk record linked to a Work Product Component, as shown in the following diagram:
<div align="center">
<br/><img src="/uploads/3d73191098963a80675c2ed6e96472cc/image.png"
alt="sdms-architectural-diagram"
style="display: block; margin: 0 auto; height: 30%; width: 30%" /><br/>
</div>
The seismic applications saves the SDMS V3 URI (also known as `sdapth`) in the `FileSourceInfo` property of the created OSDU Bulk record. This is done to later facilitate communication of the URI to SDMS V3 for retrieving the storage connection string required to access the dataset's bulk data.
### Example of SDMS V3 dataset metadata
```json
{
"name": "test-data.zgy",
"tenant": "partition",
"subproject": "subproject",
"path": "/",
"ltag": "test-legal",
"created_by": "test-user@slb.com",
"last_modified_date": "Tue Sep 12 2023 11:04:29 GMT+0000 (Coordinated Universal Time)",
"created_date": "Tue Sep 12 10:54:10 GMT+0000 (Coordinated Universal Time)",
"gcsurl": "ss-weu-xkz32bjwg2425gn/bdf36c8a-3c62-3151-12b7-227af4727520",
"ctag": "sMTz0oWeId1nOnrx",
"readonly": true,
"sbit": null,
"sbit_count": 0,
"filemetadata": {
"type": "GENERIC",
"size": 1544552448,
"nobjects": 47
},
"seismicmeta_guid": "partition:work-product-component--SeismicTraceData:326bac9a-1fb2-5c73-9c64-6ca122c5025a",
"access_policy": "uniform"
}
```
### Example of OSDU storage associated Work Product Component
```json
{
"id": "partition:work-product-component--SeismicTraceData:326bac9a-1fb2-5c73-9c64-6ca122c5025",
"kind": "osdu:wks:work-product-component--SeismicTraceData:1.3.0",
"version": 1685099234631439,
"acl": {
"viewers": [
"data.test@domain.slb.com"
],
"owners": [
"data.test@domain.com"
]
},
"legal": {
"legaltags": [
"test-legal"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"data": {
"BinGridID": "partition:work-product-component--SeismicBinGrid:2a714f2b12aa346d16a08c5a2f4e157e:",
"Datasets": [
"partition:dataset--FileCollection.Slb.OpenZGY:1de532c2-4d1b-5316-ba4a-422342321d55"
],
"DDMSDatasets": [
"urn:dataset--FileCollection.Slb.OpenZGY:1de532c2-4d1b-5316-ba4a-422342321d55"
],
"Name": "test-data.zgy",
"Source": "osdu",
"SubmitterName": "test-user@domain.com"
},
"createUser": "test-user@domain.com",
"createTime": "2023-09-12T11:04:30.321Z",
"modifyUser": "test-user@domain.com",
"modifyTime": "2023-09-12T18:09:12.703Z"
}
```
### Example of OSDU storage associated File Collection
```json
{
"id": "partition:dataset--FileCollection.Slb.OpenZGY:1de532c2-4d1b-5316-ba4a-422342321d55",
"version": "4426199321664216",
"kind": "osdu:wks:dataset--FileCollection.Slb.OpenZGY:1.0.0",
"acl": {
"viewers": [
"data.test@domain.slb.com"
],
"owners": [
"data.test@domain.com"
]
},
"legal": {
"legaltags": [
"test-legal"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"createUser": "test-user@domain.com",
"createTime": "2023-09-12T11:04:02.705Z",
"data": {
"Endian": "BIG",
"SEGYRevision": "rev 1",
"TotalSize": "1544552448",
"Name": "test-data.zgy",
"DatasetProperties": {
"FileCollectionPath": "sd://tenant/subproject/",
"FileSourceInfos": [
{
"FileSource": "test-data.zgy",
"Name": "test-data.zgy",
"FileSize": "1544552448",
}
]
}
}
}
```
## Proposed Solution
To enable applications to access bulk datasets ingested in SDMS V3 through SDMS V4, we need to update the mechanism in SDMS V4 for retrieving the correct storage URI associated with the Bulk record. This update is necessary to generate a valid connection string for accessing the bulk data.
When a Bulk record is created, the SDMS V3 URI (also known as 'sdapth') is typically saved in the `FileCollectionPath` and `FileSource` properties. In the most common scenarios, the `sd://tenant/subproject/path` portion of the URI is stored in the `FileCollectionPath` property, while the URI's name is stored in the `FileSource` property.
When a connection access string is requested for a Bulk record through SDMS V4, the service should detect if the record's file source type refers to a V3 dataset's URI. If this last case, the service should then:
1. extract the `subproject` name from the `FileCollectionPath`
```python
subproject = record.data.DatasetProperties.FileCollectionPath.replace("sd://", "").split('/')[1]
```
2. extract the `path` from the `FileCollectionPath`
```python
subproject = (record.data.DatasetProperties.FileCollectionPath.replace("sd://", "").split('/')[2:]).replace("//", "/")
```
3. extract the `name` from the `FileSource`
```python
name = record.data.DatasetProperties.FileSourceInfos[0].FileSource
```
4. retrieve the storage URL from the V3 journal
```sql
SELECT c.data.gcsurl
FROM c
WHERE
c.data.subproject="{subproject}"
AND c.data.path="{path}"
AND c.data.name="{name}"
```
5. generate the connection string using the retrieved storage URL
```python
storage_client = StorageClient("{storage-url}")
return storage_client.getConnectionString()
```
#### Notes
Seismic applications use different approaches to save the SDMS V3 URI in the Bulk record, and all these cases should be considered:
1. The sd://tenant/subproject/path is saved in the `FileCollectionPath`, and the name is saved in `FileSource`.
2. The full sd://tenant/subproject/path/name URI is saved in both `FileCollectionPath` and `FileSource`.
3. The sd://tenant/subproject/path URI is saved in `FileCollectionPath`, and the name in `FileSource`, but this latter starts with the ./ special character (which should be removed).
### Limitations
Applications that do not match the described flow should we reviewed with the application owner before defining the right strategy to enable the synchronization of datasets ingested in SDMS V3 with SDSM V4.M22 - Release 0.25Sacha BrantsSneha PoddarSacha Brantshttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/111[ADR] Synching SDMS V4 datasets in SDMS V32023-09-29T12:05:42ZDiego Molteni[ADR] Synching SDMS V4 datasets in SDMS V3# Introduction
We need a solution for make dataset ingested in SDMS V4 visible and consumed by SDMS V3.
The purpose of this ADR is to describes how to enable a synchronization mechanism that allows users of SDMS V3 to consume seismic d...# Introduction
We need a solution for make dataset ingested in SDMS V4 visible and consumed by SDMS V3.
The purpose of this ADR is to describes how to enable a synchronization mechanism that allows users of SDMS V3 to consume seismic dataset entities ingested in SDMS V4, even though the two versions of the system have entirely different architectural logics.
# Status
* [x] Initiated
* [x] Proposed
* [ ] Under Review
* [ ] Approved
* [ ] Rejected
# Problem statement
The Seismic Data Management Service V4 (SDMS V4) stores and manages data types as defined by the Open Subsurface Data Universe (OSDU) Authority. The APIs (Application Programming Interfaces) provide robust data type checks and are fully integrated with the OSDU policy service. The goal is to minimize ambiguity in the authorization model and facilitate straightforward adoption through a consistent usage pattern. In contrast, the V3 version of the service defines, saves, and manages proprietary metadata records, interacts directly with the entitlement service, and organizes records into collections/data-groups named subprojects.
<div align="center">
<br/><img src="/uploads/5e1a58219ca35be9da530b0eba2ed9fa/arch-diagram.png"
alt="sdms-architectural-diagram"
style="display: block; margin: 0 auto" /><br/>
</div>
The key difference between the two versions of the service lies in the form of the record. In the case of the OSDU record adopted by SDMS V4, it is entirely managed by the storage service. However, the V3 metadata has its own format, and to locate a dataset ingested in SDMS V4 via V3, it is necessary to create a V3 proprietary record. The following section will describe how an OSDU record can be translated into a V3 record to enable the synchronization process between the systems
# Proposed solution
Create a new service capable of detecting when a new dataset is registered in SDMS V4 and creating the corresponding record in SDMS V3
## Overview
As previously noted, in SDMS V3, the dataset descriptor has a proprietary structure and is maintained in an internal catalog. However, in SDMS V4, the descriptor is a standard OSDU record managed by the storage service. To make a datasets, ingested in SDMS V4, visible in SDMS V3 we must create a corresponding V3 metadata. This section describes how an SDMS V3 record can be created, using the OSDU record details, to make the ingested dataset in V4 visible in V3
### The SDMS V3 dataset descriptor
```json
{
"id": "the record id <used as key in the service journal catalogue>",
"data": {
"name": "the dataset name",
"tenant": "the tenant name",
"subproject": "the subproject name",
"path": "the dataset virtual folder path",
"acls": {
"admins": "list of entitlement groups with admin rights",
"viewers": "list of entitlement groups with viewer rights"
},
"ltag": "the associated legal tag",
"created-by": "the id of the user who ingested the dataset",
"created_date": "the date and time when the dataset was ingested",
"last_modified_date": "the date and time when the dataset was last modified",
"gcsurl": "the storage uri string where bulks are saved",
"ctag": "a coherency hash tag that changes every time this record is modified",
"readonly": "the access mode level",
"filemetadata": {
"nobjects": "the number of blobs composing the dataset",
"size": "the dataset bulk total size",
"type": "the type of the manifest",
"checksum": "the dataset bulk checksum",
"tier_class": "the dataset storage tier class"
},
"computed_size": "the computed dataset size",
"computed_size_date": "the date and time when the dataset size was computed",
"seismicmeta_guid": "the associated OSDU record id"
}
}
```
### The SDMS V4 record (simplified)
```json
{
"kind": "the osdu dataset kind",
"acl": {
"viewers": "list of entitlement groups with viewer rights",
"owners": "list of entitlement groups with admin rights",
},
"legal": {
"legaltags": "the list of legal tags",
"otherRelevantDataCountries": "the list of data countries",
"status": "the legal status"
},
"data": {
"Name": "the dataset name",
"Description": "the dataset description",
"TotalSize": "the dataset total size",
"DatasetProperties": {
"FileCollectionPath": "the dataset virtual folder path",
"FileSourceInfos": [
{
"FileSource": "the file component source",
"PreloadFilePath": "the file component origin",
"Name": "the file component name",
"FileSize": "the file component size",
"Checksum": "the file component checksum",
"ChecksumAlgorithm": "the checksum algorithm"
}
],
"Checksum": "the dataset checksum"
}
}
```
### ADR symbols definitions
To make it simpler for the reader to understand the examples in the following sections, we define the following symbols:
| Symbols | Description |
| --- | --- |
| RV3 | the SDMS V3 record |
| RV4 | the SDMS V4 record |
| RV4.DatasetProperties | the record_v4.data.DatasetProperties element |
| RV4.FileSourceInfos | the record_v4.data.DatasetProperties.FileSourceInfos element |
### The SDMS V3 record generation in detail
- `RV3.id`
The ID in SDMS V3 is autogenerated based on the values composing the SDMS V3 URI: `tenant`, `subproject`, `path` and `name`.
```python
hash_obj = hashlib.sha512()
hash_obj.update((RV3.data.path + RV3.data.name).encode('utf-8'))
hashed_value = hash_obj.hexdigest()
cosmos_record["id"] = 'ds-' + RV3.data.tenant + '-' + RV3.data.subproject + '-' + hashed_value
```
- `RV3.data.name`
The dataset name.
```python
if 'Name' in RV4.data:
RV3.data.name = RV4.data.Name
elif len(FileSourceInfos) == 1 and 'Name' in FileSourceInfos[0]
RV3.data.name = FileSourceInfos[0].Name
else:
RV3.data.name = RV4.id
```
- `RV3.data.tenant`
The dataset tenant name matches the data-partition-id in the OSDU model. This specific information cannot be automatically detected in a V4 record but can be easily detected by the syncing process .
```python
RV3.data.tenant = data_partition_id
```
- `RV3.data.subproject`
The dataset resource group name (referred to as subproject in SDMS V3) must exist in SDMS V3 with the `access_policy` property set to `dataset`. Essentially, each partition in SDMS V3 should have a default data group where all SDMS V4 datasets can be collected. This required data group can be automatically created by the syncing process. The name of the data group will default to `syncv4`.
```python
RV3.data.subproject = "syncv4"
```
- `RV3.data.path`
The dataset virtual path represents the logical folder structure in the data group (subproject) where the dataset is stored.
```python
RV3.data.path = RV4.DatasetProperties.FileCollectionPath
```
- `RV3.data.acls`
The Access Control List (ACL) defines the list of users with admin and viewer rights. The only difference is that in the SDMS V3 record, the `owners` list is named `admins`, while the `viewers` list has matching names.
```python
RV3.data.acls.admins = RV4.acls.owners
RV3.data.acls.viewers = RV4.acls.viewers
```
- `RV3.data.ltag`
In SDMS V3, legal tag information is represented by a unique value, whereas in SDMS V4, it is represented as a list. To simplify the record composition, we select the first valid legal tag from the V4 record list. If no valid legal tags are found in the V4 record, we should always set an invalid legal tag in V3. If this is not set, V3 will inherit a valid legal tag from the data group, risking the possibility of a non-accessible record in V4 being addressable in V3.
```python
for tag in RV4.legal.legaltags:
if isValid(tag):
RV3.data.ltag = tag
break
if tag is None:
RV3.data.ltag = RV4.legal.legaltags[0]
```
- `RV3.data.created-by`
The user who created/ingested the dataset.
```python
RV3.data['created-by'] = RV4.createUser
```
- `RV3.data.created_date`
The timestamp when the dataset was created/ingested.
```python
RV3.data.created_date = RV4.createTime
```
- `RV3.data.last_modified_date`
The timestamp when the dataset was last modified.
```python
RV3.data.last_modified_date = RV4.modifyTime
```
- `RV3.data.gcsurl`
The storage ID of the container/bucket where dataset bulk files are stored. This value is automatically generated based on the record ID value.
```python
hash_obj = hashlib.sha256()
hash_obj.update(RV4.id.encode('utf-8'))
RV3.data.gcsurl = hash_obj.hexdigest()[:-1]
```
- `RV3.data.ctag`
The Coherency Tag (ctag) is a hash code associated with the dataset descriptor that changes every time the metadata is updated. This property exists only in SDMS V3, and it is autogenerated.
```python
alphabet = string.ascii_letters + string.digits
RV3.data.ctag = ''.join(secrets.choice(alphabet) for _ in range(16))
```
- `RV3.data.readonly`
The `readonly` property defines the dataset's status regarding readability. If set to `false`, the dataset can be accessed in both read and write modes. If set to `true`, the dataset can only be accessed in read mode. In SDMS V4, a dataset cannot be marked as `readonly`, and for this reason, in the generated V3 record, the value will be defaulted to `false`.
```python
RV3.data.readonly = False
```
- `RV3.data.filemetadata`
The `filemetadata`, also known as the dataset manifest, is an object containing information about how the dataset's bulks are stored in the cloud storage resource. The only supported manifest in SDMS V3 is the `GENERIC`, which requires that all objects composing the dataset be saved in sequential order using the `0` to `N-1` naming convention, where `N` is the number of objects. The fields composing the dataset manifest are:
`nobjects`: the number of objects composing the dataset. this value can be computed by counting the number of objects composing the dataset.
`size`: the dataset total size can be computed by summing the sizes of all objects composing the dataset. Alternatively, if it exists, the `RV4.data.TotalSize` can be used, but computing it will provide a better and clearer result.
`type`: the manifest type, with `GENERIC` the only supported.
`checksum`: the dataset checksum.
`tier_class`: the dataset storage tiering class.
```python
blob_list = getBlobClient(connectionString)
size = 0
tier_class = None
objects_num = 0
error = False
for blob in blob_list:
if blob.name != str(count):
error = True
if tier_class == None:
tier_class = blob.blob_tier
objects_num = objects_num + 1
size = size + blob.size
if not error:
RV3.data.filemetadata.type = 'GENERIC'
RV3.data.filemetadata.nobjects = objects_num
RV3.data.filemetadata.size = size
if 'Checksum' in RV4.DatasetProperties:
RV3.data.filemetadata.checksum = RV4.DatasetProperties.Checksum
RV3.data.filemetadata.tier_class = tier
else:
RV3.data.filemetadata = None
```
- `RV3.data.computed_size`
The `computed_size` is generated by SDMS V3 when the `/size` endpoint is triggered. This endpoint calculates the size of the datasets by summing the sizes of all composing objects. This field has been introduced because the dataset filemetadata object is an optional field created by client applications, such as sdapi or sdutil, and can only be trusted by them.
```python
blob_list = getBlobClient(connectionString)
size = 0
for blob in blob_list:
size = size + blob.size
RV3.data.computed_size = size
```
- `RV3.data.computed_size_date`
This is the timestamp of when the dataset size has been computed by SDMS V3.
```python
RV3.data.computed_size_date = str(datetime.datetime.now())
```
- `RV3.data.seismicmeta_guid`
The `seismicmeta_guid` is the ID of a record linked with the SDMS V3 dataset. This can be associated with the SDMS V4 record so all extra properties can be downloaded by consumer applications.
```python
RV3.data.seismicmeta_guid = RV4.id
```
### The Script to validate the proposed conversion
- The script [sync-script.py](/uploads/2421d4b04fe2a6fdd560f1df321e5d36/sync-script.py) is provided with this ADR (for testing purposes only) to demonstrate and validate the synching flow between SDMS V4 and V3:
- Create a random data file of 16MB and compute the checksum
- Fill an OSDU record and register it in SDMS V4
- Upload the 16MB file as 4 objects of 4MB each using the connection string generated via SDMS V4
- Generate an V3 metadata record and register it in SDMS V3
- Ensure the dataset in SDMS V3 can be located after ingestion
- Download all objects using the connection string generated via SDMS V3
- Compare the initial object with the download one to ensure these match
#### Example of an SDMS V4 ingested record
```json
{
"id": "opendes:dataset--FileCollection.SEGY:7fe06451787641c4953a06a63e44967a",
"kind": "osdu:wks:dataset--FileCollection.SEGY: 1.1.0",
"version": 1694519237996696,
"acl": {
"viewers": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.viewer@opendes.domain.com"
],
"owners": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.admin@opendes.domain.com"
]
},
"legal": {
"legaltags": [
"ltag-seistore-test-01"
],
"otherRelevantDataCountries": [
"US"
],
"status": "compliant"
},
"modifyUser": "test-user@domain.com",
"modifyTime": "2023-09-07T11:47:18.625Z",
"createUser": "test-user@domain.com",
"createTime": "2023-09-07T07:17:58.443Z",
"data": {
"Name": "data-sync.segy",
"TotalSize": "16777216",
"Description": "SDMS synching test record",
"DatasetProperties": {
"FileCollectionPath": "/f1/f2/f3/",
"FileSourceInfos": [
{
"FileSource": "data-sync.segy",
"Name": "data-sync.segy",
"FileSize": "16777216",
"Checksum": "8ce2025f9b27e3017ab15f15b261d599",
"ChecksumAlgorithm": "MD5"
}
],
"Checksum": "8ce2025f9b27e3017ab15f15b261d599"
}
}
}
```
#### Example of a generated SDMS V3 metadata
```json
{
"id": "ds-opendes-syncv4-c0699ac77bc64a5772ac7f6f455ce5a251e3686d87d26e91df2ecc73e7bfdf4b0a16ac757c2ec227c1a6814d097a0b6b759a01dc52753754a0a18dfaea53c7d0",
"data": {
"name": "data-sync.segy",
"tenant": "opendes",
"subproject": "syncv4",
"path": "/f1/f2/f3/",
"acls": {
"admins": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.admin@opendes.domain.com"
],
"viewers": [
"data.sdms.opendes.tdata.fe6730f9-bb3d-46a3-9f03-3d529e32360d.viewer@opendes.domain.com"
]
},
"ltag": "ltag-seistore-test-01",
"created-by": "test-user@domain.com",
"created_date": "2023-09-07T07:17:58.443Z",
"last_modified_date": "2023-09-07T11:47:18.625Z",
"gcsurl": "a5993feef91df715c176452fe1a26d04ca70e88d0ccff268e92cd74c76dde61",
"ctag": "9STTAfiKl4iukKbp",
"readonly": "false",
"filemetadata": {
"nobjects": 4,
"size": 16777216,
"type": "GENERIC",
"checksum": "8ce2025f9b27e3017ab15f15b261d599",
"tier_class": "Hot"
},
"computed_size": 16777216,
"computed_size_date": "2023-09-12 13:47:45.877142",
"seismicmeta_guid": "opendes:dataset--FileCollection.SEGY:7fe06451787641c4953a06a63e44967a"
}
}
```
### SDMS V4 to V3 Synching Automation
The preceding section explains the process of creating a metadata descriptor for SDMS V3 using an OSDU record. This metadata descriptor enables access to a dataset ingested in SDMS V4 through SDMS V3.
In order to automate the process, we will deploy a new service called the `sdms-sync-service`, which will be responsible for generating an SDMS V3 record every time a new dataset is registered in SDMS V4. When a dataset is registered in SDMS V4, a message will be pushed into a Redis queue `insert-synch-v4:{record-id}:{partition}:{other-required-params}`. The new service will consume the messages from the Redis queue and initiate the synching process:
- retrieve the OSDU record from storage service
- generates the corresponding SDMS V3 metadata descriptor
- saves the generated metadata in the SDMS V3 journal.
<div align="center">
<br/><img src="/uploads/b2d6eb24b28516feb0908e5ef7232a2e/sdms-sync-service.png"
alt="sdms-sync-service"
style="display: block; margin: 0 auto" /><br/>
</div>
### Details
- If a dataset is patched in SDMS V4, the service should push an `insert` message into the Redis queue:
- If the previous `insert` message is still in the queue (not yet consumed by the sync service), the existing entry will be overwritten in the queue, and the sync service will create the updated one.
- If the previous version was already synced, when the new message is consumed, the updated record will be created, and because the generated key is identical, it will overwrite the existing record in the journal.
- if a dataset is delete in SDMS V4 the service should push a `delete` message in the Redis queue.
- When the delete message is consumed, the sync service will generate only the V3 record key and remove the entry from the journal.
- If the `insert` message was still not consumed from the queue, when the sync service consume it it should check if a `delete` message is also present for the same record. In case this is located in the queue, the sync service will skip the sync process and remove both entry `insert` and `delete` from the Redis queue.
### Limitations
When a dataset is registered in V4 via a client app, the record is created instantaneously, while uploading the bulk data into the storage resource takes longer. If the `insert` message is consumed before the bulk data is uploaded, the file manifest cannot be computed due to missing objects. To address this issue, we can enable a background process in the `sync-service` that loops over the created SDMS V3 records and updates the manifest in cases where it does not exist or when the last modified time in the corresponding SDMS V4 record is greater than the one reported in the V3 entry. This approach should be re-discussed with the community to find an optimal strategy to apply.M22 - Release 0.25Sacha BrantsMark YanSacha Brantshttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/109Unsupported Feature in Dataset LS Get Endpoint Causing Test Failures on AWS a...2023-09-06T20:02:06ZPratiksha ShedgeUnsupported Feature in Dataset LS Get Endpoint Causing Test Failures on AWS and AnthosA new feature has been introduced for the dataset LS get endpoint, comprising the Search (to select a single SQL-like search parameter) and Select (to choose multiple fields for retrieval) query parameters. The API is expected to return ...A new feature has been introduced for the dataset LS get endpoint, comprising the Search (to select a single SQL-like search parameter) and Select (to choose multiple fields for retrieval) query parameters. The API is expected to return a list of datasets based on the search and select query parameters. However, AWS and Anthos do not support this new feature for this endpoint, leading to test failures during pipeline runs.
Pipeline runs:
AWS: https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/jobs/2200880
Anthos: https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/jobs/2200882https://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/108[ADR] Hierarchical data distribution statistics based on path - API endpoint2023-12-21T12:00:55ZIzabela Kulakowska[ADR] Hierarchical data distribution statistics based on path - API endpoint# Introduction
We need a solution for retrieving dataset statistics currently consisting of only dataset sizes.
The purpose of this ADR is to define the approach for retrieving the hierarchical data distribution statistics based on a p...# Introduction
We need a solution for retrieving dataset statistics currently consisting of only dataset sizes.
The purpose of this ADR is to define the approach for retrieving the hierarchical data distribution statistics based on a path.
# Status
* [x] Initiated
* [x] Proposed
* [x] Under Review
* [ ] Approved
* [ ] Rejected
# Problem statement
The SDMS API currently exposes the following endpoints for managing the datasets sizes:
- `POST /dataset/tenant/{tenantid}/subproject/{subprojectid}/dataset/{datasetid}/size` - computes the actual dataset size and updates the dataset metadata `computed_size` field.
- (deprecated) `GET /dataset/tenant/{tenantid}/subproject/{subprojectid}/sizes` - fetches the sizes of the datasets based on the metadata field `filemetadata.size`.
# Proposed solution
Create new API endpoint for retrieving the total size value for a dataset, a subfolder and a subproject. The new endpoint would require _viewer_ or _admin_ roles.
## Overview
```bash
GET /dataset/tenant/{tenant}/subproject/{subproject}/size?path={path}&datasetid={datasetname}
```
Path parameters:
- **tenant** - tenant
- **subproject** - subproject
Query parameters:
- **path** - folder path for which the analytics are going to be retrieved [mandatory if query parameter `{datasetid}` is specified]
- **datasetid** - dataset name for which the analytics are going to be retrieved
Response:
HTTP 200
```json
{
"dataset_count": 9999,
"size_bytes": 1024
}
```
- **dataset_count** - number of datasets under a specific subproject/folder
- **size_bytes** - sum of sizes [B] of all datasets under a specific subproject/folder or for a specific dataset
### Examples:
- `GET /dataset/tenant/tenant1/subproject/subproject1/size` - fetch and sum sizes of all datasets in the `subproject1`
- `GET /dataset/tenant/tenant1/subproject/subproject1/size&path=folderA/folderB` - fetch and sum sizes of all datasets under the folder path `folderA/folderB` in subproject `subproject1`
- `GET /dataset/tenant/tenant1/subproject/subproject1/size&path=folderA/folderB&datasetid=file.txt` - fetch the size of a dataset with a name `file.txt` that resides under the folder path `folderA/folderB` in subproject `subproject1`
## Details
Currently, two fields in the dataset metadata record can store information about the dataset size: `filemetadata.size` and `computed_size`. `filemetadata.size` is being used by the SDK on the client side, `computed_size` is intended to be computed and ingested on the server side.
To make sure the chosen field can be a reliable source of truth, the API endpoint implementation will calculate the sum of dataset sizes based on `compute_size` field.
# Out of scope / limitations
A challenge with using `computed_size` field as a source of truth is that some datasets may not have this property calculated, as currently the only way to update this value is by manually calling the `Compute Size` POST endpoint.
The solution to ensure the reliability of the value of the `computed_size` field will be the subject of a separate ADR.M22 - Release 0.25Izabela KulakowskaIzabela Kulakowskahttps://community.opengroup.org/osdu/platform/domain-data-mgmt-services/seismic/seismic-dms-suite/seismic-store-service/-/issues/107[ADR] Hierarchical deletion of datasets2024-01-05T10:29:26ZMaggie Salak[ADR] Hierarchical deletion of datasets# Introduction
We need a way to delete millions of datasets (including metadata and files in blob storage) in Seismic DMS. A single delete operation can include up to 50 million datasets.
The purpose of this ADR is to define the approa...# Introduction
We need a way to delete millions of datasets (including metadata and files in blob storage) in Seismic DMS. A single delete operation can include up to 50 million datasets.
The purpose of this ADR is to define the approach to implementing a hierarchical delete feature in SDMS.
# Status
* [x] Initiated
* [x] Proposed
* [x] Under Review
* [ ] Approved
* [ ] Rejected
# Problem statement
SDMS API currently exposes the following endpoints for deleting datasets:
- `DELETE /dataset/tenant/{tenantid}/subproject/{subprojectid}/dataset/{datasetid}`
Deletes a single dataset.
- `DELETE /subproject/tenant/{tenantid}/subproject/{subprojectid}`
Deletes a subproject.
The endpoint deleting a subproject currently does not scale to the required number of datasets. The current implementation also leaves a possibility of an inconsistent state between the metadata and files in blob storage - in case some of the files fail to be deleted, the deletion of metadata associated with these datasets is not reverted.
SDMS currently does not have the functionality of deleting only selected datasets in a subproject, filtered by path, tags, labels, etc.
# Proposed Solution
In short:
- Create new API endpoints to support starting and tracking progress of the asynchronous deletion operation.
- Deploy a new service on k8s that would asynchronously delete datasets.
## Overview
We will introduce the bulk-delete feature as follows:
1. Implement and deploy a separate application to the same K8s cluster: the _deletion service_.
This service will accept the bulk deletion requests from SDMS API, perform the deletion and keep track of the progress of this long-running operation.
2. Add the new endpoint to SDMS API to delete all datasets in a specified path:
`PUT /operations/bulk-delete?sdpath={sdpath}`
Status: 202 Accepted
`sdpath` in the format `sd://tenant/subproject/path`
Response schema:
```json
{
"operationId": "{string}"
}
```
3. Add the new endpoint to SDMS API to view the status and progress of the delete operation:
`GET /operations/bulk-delete/status/{operationid}`
Status: 200 OK
Response schema:
```json
{
"OperationId": "{string}",
"CreatedAt": "{string}",
"CreatedBy": "{string}",
"LastUpdatedAt": "{string}",
"Status": "{string}",
"DatasetsCnt": "{int}",
"DeletedCnt": "{int}",
"FailedCnt": "{int}"
}
```
Headers will contain `data-partition-id` information to check if the user is registered in the partition before retrieving the operation status.
## Details
### Initiating a delete operation
- The new `PUT` endpoint will support the following cases for the dataset path, provided in the `sdpath` parameter:
- `path = /<path>/` - all datasets under the specified path should be deleted.
- path not specified - all datasets in the subproject should be deleted.
If the deletion of the subproject (metadata and container) is desired as well, the clients should call the delete subproject endpoint after the datasets bulk delete operation completes to ensure non-blocking deletion of the subproject in case it is composed by many datasets.
- The endpoint triggers the deletion job and returns the ID of the initiated operation.
- The delete operation is initiated in SDMS by pushing a message onto a queue (Azure Storage queue in case of Azure implementation; a different queuing mechanism can be used by other CSPs); the message contains the `operationId` and the parameters from the original request (tenant, subproject, path).
### Deletion service
Deletion service is a separate component from SDMS API, deployed to the same K8s cluster. The implementation details of the service can be decided by the individual CSPs. This section describes the proposed implementation for Azure.
The source code of the new component will be contributed to the Sidecar solution in the `seismic-store-service` repository.
The logic of the deletion service will work as follows:
- The service consumes the message from the Azure Storage queue and initiates the deletion process.
- All items (dataset IDs and `gcsurl` which determines the location in blob storage) matching the provided subproject and path are retrieved from Cosmos database.
- For each dataset, the deletion service checks if it is locked.
- If yes, the item is discarded from the delete operation.
- If not, the deletion service locks the dataset. The lock value in this case will contain a string indicating that the dataset is locked for deletion (e.g. WDELETE). This will allow another delete operation to delete the dataset if the deletion failed previously. However, it will prevent deletion of datasets locked with a regular write lock which would indicate that it is being actively used.
- The retrieved items are added to storage which allows the deletion service to keep track of the datasets to delete. In the first version of the implementation, the deletion service will store the retrieved datasets in memory.
In a later phase we are planning to use a persistent storage (e.g. Service Bus queue) to store the items to be deleted. This will allow the service to resume deletion after a restart as well as retry deletion for the datasets where it failed.
- The deletion service leverages existing Redis queues to keep track of the overall deletion operation status and progress.
- The deletion service retrieves and deletes the datasets by checking the store containing items to be deleted. In the first version of the implementation it simply iterates over items stored in memory.
- The datasets are processed in batches; for each batch we retrieve the associated blobs from the storage account using the `gcsurl` property of the metadata.
- The blobs from the current batch are deleted.
- We then delete the metadata documents from Cosmos DB, leaving the ones for which the blob deletion was unsuccessful. We consider that the deletion was successful if the blobs were not found as we assume they were deleted earlier.
- The deletion status is updated in Redis after processing every dataset.
- At the end, the status of a completed operation (with errors or without) is saved in Redis.
- The deletion status should not be deleted at this point so that users can query the operation status after completion.
### Sequence diagram for the deletion operation
![deletion_diagram_osdu](/uploads/b097c46896644e19a7374df96560aabd/deletion_diagram_updated.png)
### Deletion status
The status of delete operations will be saved in Redis.
It will be written by the deletion service (updated with the current progress) and read by SDMS API
(when users request the deletion status).
SDMS API and the deletion service will agree on the naming convention for the key in Redis,
e.g. `deletequeue:status:{operationId}`.
The new `GET` endpoint allowing users to query the status of a delete operation will return the following information:
- **`OperationId`** - ID of the delete operation.
- **`Status`** - Current status of the delete operation; possible values are: 'Not started', 'Started', 'In progress', 'Completed', 'Completed with errors'.
- `CreatedAt` - Timestamp of the creation of the delete operation.
- `CreatedBy` - Entity initiating the delete operation.
- `LastUpdatedAt` - Timestamp of the last status update of the delete operation.
- `DatasetsCnt` - Total number of datasets to be deleted; initially not set, until the enumeration of datasets for deletion is completed.
- `DeletedCnt` - Number of deleted datasets; updated after each dataset processed by the deletion service, after both blobs and metadata are deleted.
- `FailedCnt` - Number of datasets for which the deletion failed; updated after each dataset processed by the deletion service if a failure occurs.
_(only the fileds in **bold** are required)_
_(dataset counts will be empty if the status is "not started")_
### Sequence diagram for the deletion status
![deletion_status_diagram](/uploads/52b27cfb56a9942cf7628e81aeb41eec/deletion_status_diagram.png)
# Out of scope / limitations
- Detailed statistics about datasets which failed to be deleted. In the first phase of implementation the deletion status endpoint will provide aggregated statistics as mentioned in the `Deletion status` section. Users will need to refer to logs to find out which datasets failed to be deleted.
- The bulk-delete feature does not guarantee the operation can continue after a restart of the deletion service. It will be up to the different CSPs to determine if there is retry logic for failed datasets or recovery support built into the service.
- Deleting 'orphan' blobs with missing metadata. Files without metadata containing a matching `gcsurl` will not be deleted as part of the delete operation as metadata is the source of truth for which blobs need to be deleted.
- Identifying blobs belonging to a different dataset but located in the same virtual folder as files of another dataset. Since `gcsurl` carries information about the location of files to be deleted, the delete operation will not be able to detect 'unrelated' files erroneously uploaded with the same virtual folder.
# Consequences
The same bulk deletion API endpoints can be implemented by any CSPs besides Azure.
The status endpoint is not CSP-specific. As long as the bulk delete implementation saves
the job status with the same schema to Redis, the status endpoint will work for any other CSP out of the box.M22 - Release 0.25Diego MolteniMark YanMaggie SalakSneha PoddarDiego Molteni