ADR: Namespacing storage records

Background

The OSDU is agreeing on a new EA level ADR for 'collaborations'. This is a wide ranging and broad problem that is trying to be solved. You can see info at the EA level here.

At its heart is the idea that data must be separated between the system of record and system of engagement. Today the OSDU only supports the system of record. All data therefore by default resides in the system of record and the APIs we use read, write and delete from the system of record.

In this ADR we are looking at how we can separate data in Storage service into separate namespaces. These namespaces can in the future be linked to a specific collaboration, which will form the system of engagement.

The system of engagement is meant to be interacted with by any application wanting to add/update data into the OSDU. Therefore we should have some understanding of what application is making the requests into the system of engagement.

We are starting with storage service as all other changes needed for the system of engagement data separation will be driven by this change.

As shown, the system of engagement can have many namespaces, one for each collaboration.

A single storage record can reside in any number of namespaces. A namespace can also have 0 or many Records.

A storage record consists of 2 parts, the metadata and the data.

{
   id: "opendes:mastered-wellbore:12345678",
   kind: "osdu:wks:mastered-wellbore:1.0.0",
   ...
   ...
   data: {
      ...
      ...
   }
}

Everything inside the 'data' json object shown above is classed as the data and everything else is the 'metadata'.

These are stored separately by the storage service in a 1-many relationship. Every time a Records data is updated it creates a new version of that data that points to a single metadata instance.

The reference is held directly in the metadata. We can think of the referencing of the data blocks to the metadata like this

Diagram 1

The latest data version referenced is the 'head' and is returned by default when no version is specified when using the Storage APIs.

If I retrieve an older version of the 'data' I am only ever returned the same version of the metadata.

With collaboration there is the possibility that many 'heads' exist at the same time, one per collaboration. There can be many collaborations and each collaboration can hold many entities.

Each collaboration should be treated independently. therefore any change to a Record in the context of a collaboration should be reflected only in that context and not affect any others.

Out of scope

For this ADR we are looking only at how we separate data in Storage service between the System of Record (what exists today in OSDU) and System of engagement (collaborations).

We are not deciding on

How DDMS will separate the data
How Consumption services like search separate the data
How data will transfer between the system of Record and system of engagement in Storage
How collaborations will act on this or control this behavior or even what a collaboration entity looks like
Any other service that might need to act on a collaboration context e.g. ingestion

Solution

The suggestion is to create a different instance of the Storage metadata specific to the collaboration context. It is stored using a compound key of the record id + the collaboration id.

This collaboration id forms the namespace for a record, and combining the 2 means we have a unique metadata instance per collaboration.

Therefore if a Record is not assigned to a collaboration the namespace is the same as it is today (empty) and the id remains unchanged. This maintains current system behavior for existing data in the system of record.

Note: The Record ID is never changed between namespaces and should be persisted and returned to the user the same as it is today no matter the context provided. The id of the document/row used in the database should append the namespace value so that multiple metadata instances can coexist for the same Record ID. This means the data model of the metadata needs to have a separate record id and row/document id value.

References to the data are held in each metadata allowing the same data to be referenced by multiple namespaces but also to have unique versions of a record Id to exist in individual namespaces. The reference is also quick and cheap to add/remove from different namespaces.

Diagram 2

Note that multiple collaborations could be active at the same time and the 'data' versions does not have to be linear between them. For example changes from different collaborations could overlap one another. This is because the version is already defined as an epoch timestamp and so is versioned based on when it was created.

Diagram 3

Behavior of retrieval APIs

If we take diagram 3 as the current state of a Record we can look at how different API requests to it should be handled with and without a collaboration context.

Getting latest in collaboration 1

curl -X 'GET' \
  '<osdu>/api/storage/v2/records/<id>' \
  --header  'x-collaboration: id=collaboration 1,application=<app-name>;' \
  -- data-raw

Expected Result: V7 returned

Retrieving version 4 when no collaboration provided

curl -X 'GET' \
  '<osdu>/api/storage/v2/records/<id>/versions/<version4>' \
  -- data-raw

Expected Result: Error, version 4 does not exist

Retrieving version 4 when collaboration 2 provided

curl -X 'GET' \
  '<osdu>/api/storage/v2/records/<id>/versions/<version4>' \
  --header  'x-collaboration: id=collaboration 2,application=<app-name>;' \
  -- data-raw

Expected Result: Error, version 4 does not exist

Collaboration context header

The x-collaboration is an optional HTTP header that holds directives in requests instructing the Storage service to handle in context of the provided collaboration instance and not in the context of the system of record. We are designing it using directives so that is is more extensible overtime to incorporate other elements potentially needed by the collaboration feature set.

NB: In the fullness of time many services will be impacted by the collaboration EA requirements. They could/should re-use this same header to support acting on a specific collaboration context for consistency and usability.

Syntax

Collaboration directives follow the validation rules below:

Directives are case-insensitive but lowercase is recommended
Multiple directives are comma-separated

Request Directives

Request	Description
id	Mandatory. The ID of the collaboration to handle the request against.
application	Mandatory. The name of the application sending the request.

Examples

Retrieve a specific version of a Record that exists in a collaboration

curl -X 'GET' \
  '<osdu>/api/storage/v2/records/<record-id>/versions/<version>' \
  -header  'data-partition-id: opendes' \
  --header 'authorization: Bearer <JWT>' \
  --header  'Content-Type: application/json' \
  --header  'x-collaboration: id=<collaboration-id>,application=<app-name>;' \
  --data-raw '

Retrieve a specific version of a Record that exists the system of record

We do not send a collaboration context here as it wants to access data from the system of record. This is the same request the user should be doing today.

curl -X 'GET' \
  '<osdu>/api/storage/v2/records/<record-id>/versions/<version>' \
  -header  'data-partition-id: opendes' \
  --header 'authorization: Bearer <JWT>' \
  --header  'Content-Type: application/json' \
  --data-raw '

Note the given record id and version of the record must exist in both the system of record and the collaboration id for both API requests to return successfully.

Record changed on namespace

To guarantee that the current system behavior is not changed we will create a new record changed topic that is triggered only when A record is edited in some way in context to a collaboration.

This means the existing record changed topic remains unchanged and is triggered only when changes are made in the system of record like they are today.

The new Record changed on namespace topic can then be bound to by downstream listeners over timer as and when they want to support the namespace concept.

The new message will also include the extra context information about the namespace. The message will be the same as the current record change message except it will include the new header

'''
x-collaboration: id=<id>,application=<app-name>; 
'''
...

On top of this the new topic should be exposed through the Notification service so it can be registered to by external consumers as needed.

Consequences

The storage service should support a new 'collaboration' header. Anytime a collaboration id is provided in this header the storage service should act only in that context. This should mean all storage APIs need to act specific to the collaboration context given, for creation, update, retrieval and deletion of records.

If no header is provided the Storage service should function the same as it does today and no change in behavior should be observed.

In the shared code section we will generate a new 'collaboration context' class that is passed into the CSP specific data layer. This property will have the collaboration id and application name. Each CSP should use this combined with the record id for the primary key of the metadata's data model. In this way the collaboration id forms the namespace of the record id so multiple metadata's can exist simultaneously.

We need a new 'Record changed collaboration' message and have it exposed through notification service

The hard delete API needs to validate all contexts before deleting the blob as multiple contexts could be referencing the same blob instance

Edited Nov 29, 2022 by ashley kelham