Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • H Home
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 37
    • Issues 37
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Deployments
    • Deployments
    • Releases
  • Packages and registries
    • Packages and registries
    • Model experiments
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Insights
    • Issue
  • Activity
  • Create a new issue
  • Issue Boards
Collapse sidebar
  • Open Subsurface Data Universe SoftwareOpen Subsurface Data Universe Software
  • Platform
  • System
  • Home
  • Issues
  • #81

ADR: Storage record kind update between versions resulting in data duplication

Storage record kind update between versions is resulting in data duplication in downstream services.

Status

  • Proposed
  • Trialing
  • Under review
  • Approved
  • Retired

Context & Scope

Receiving duplicates record IDs on Search requests is not expected. Doing so can break client workflows as they see the same record twice and don't know which is the correct version.

Storage service allows users to change the kind between record versions. This change is not exposed to any downstream service and can result in this data duplication in search.

For example:

  • User ingests a record with id-1 and kind-1 via storage service
  • Indexer service creates index for kind-1 and indexes the record with id-1
  • User updates record with id-1 and changes the kind to kind-2 via storage service
  • Indexer service creates another index for kind-2 and indexes the record with id-1

This results in having duplicate records indexed. If users make search via Search service, they will find two matches in different kinds.

Tradeoff Analysis

One solution may be to prevent kind changes on a record in Storage between record versions and so this situation can never happen.

However, this goes against the versioning of a schema. Schemas can change the minor or patch revision without considering it a breaking change. The data definitions team make use of this on OSDU schemas.

Therefore it is legitimate for a record instance that references a kind to change the minor or patch version it references and have it still represent the same instance of an entity. If we did enforce re-ingest and duplication for patch and minor changes to schemas this would result in an explosion of data in the system potentially having major cost and performance impact as well as usability issues as the same instance of an entity is declared in multiple versions of the same kind.

e.g.

if my record referenced kind

osdu:mysource:mytype:1.0.0

I should be able to update a record with the schema

osdu:mysource:mytype:1.0.1

Similarly, I could then prevent changes to a schema referenced except on a minor or patch update.

However It is fairly common for an ingestion to reference the wrong schema in a record. In this scenario we could ask the user to delete all records and re-ingest with a new kind referenced. However, this is a heavyweight operation if you consider this could relate to millions of records.

If there is a supported workflow for changing the kind between versions it makes sense then to just allow users to change the kind in general to allow them to more easily re-assign in an error scenario.

A different solution could be for the indexer service to see if the record exists anywhere and delete it before indexing a new record. However this service creates a different elastic index per kind. Searching all indexes for a single record is therefore expensive. Doing this for every record indexed would add a large burden on the system which is either going to result in performance and scalability issues or cost implications because you provision more resources to the elastic index.

Therefore a solution where we present more information to the index to allow it to find the specific id it would need to delete in this scenario makes sense to preserve performance and correctness.

Decision

We can resolve this by updating the record change event. Downstream services must be updated to consume these updated events as well.

Currently the record changed event is structured as below

{
  "id":"opendes:Wellbore:id1",
  "kind":"opendes:wks:master-data--Wellbore:1.0.0",
  "op":"create"
}

We would like to

  • Change create event op to update when the record is being updated. Indexer already supports the 'update' operation as it does exist in Storage service today. However it is a bug that storage service sends the 'create' op even when a recod is actually being updated. We propose fixing that here.
  • Introduce a new attribute previousVersionsKind to show the kind on the previous version of the record

update event without kind change

{
  "id":"opendes:Wellbore:id1",
  "kind":"opendes:wks:master-data--Wellbore:1.0.0",
  "op":"update",
  "previousVersionsKind": "opendes:wks:master-data--Wellbore:1.0.0"
}

update event with kind changed

{
  "id":"opendes:Wellbore:id1",
  "kind":"opendes:wks:master-data--Wellbore:1.0.1",
  "op":"update",
  "previousVersionsKind":"opendes:wks:master-data--Wellbore:1.0.0"
}

Consequences

  • Record change event needs to send previousVersionsKind property (common logic change)
  • Record changed event needs to send Update op (common logic change)
  • Register service and accompanying documentation needs to update the example record changed event information to reflect the new property in list topics API (common logic change)
  • Storage Patch API should support updating the kind field on Storage records
  • Indexer needs to check if the kind has changed between updates and delete the previous instance if it has (common logic change)
Edited Nov 02, 2021 by Alok Joshi
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking