ADR: Storage record kind update between versions resulting in data duplication
Storage record kind
update between versions is resulting in data duplication in downstream services.
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
Receiving duplicates record IDs on Search requests is not expected. Doing so can break client workflows as they see the same record twice and don't know which is the correct version.
Storage service allows users to change the kind
between record versions. This change is not exposed to any downstream service and can result in this data duplication in search.
For example:
- User ingests a record with
id-1
andkind-1
via storage service - Indexer service creates index for
kind-1
and indexes the record withid-1
- User updates record with
id-1
and changes the kind tokind-2
via storage service - Indexer service creates another index for
kind-2
and indexes the record withid-1
This results in having duplicate records indexed. If users make search via Search service, they will find two matches in different kinds.
Tradeoff Analysis
One solution may be to prevent kind changes on a record in Storage between record versions and so this situation can never happen.
However, this goes against the versioning of a schema. Schemas can change the minor or patch revision without considering it a breaking change. The data definitions team make use of this on OSDU schemas.
Therefore it is legitimate for a record instance that references a kind to change the minor or patch version it references and have it still represent the same instance of an entity. If we did enforce re-ingest and duplication for patch and minor changes to schemas this would result in an explosion of data in the system potentially having major cost and performance impact as well as usability issues as the same instance of an entity is declared in multiple versions of the same kind.
e.g.
if my record referenced kind
osdu:mysource:mytype:1.0.0
I should be able to update a record with the schema
osdu:mysource:mytype:1.0.1
Similarly, I could then prevent changes to a schema referenced except on a minor or patch update.
However It is fairly common for an ingestion to reference the wrong schema in a record. In this scenario we could ask the user to delete all records and re-ingest with a new kind referenced. However, this is a heavyweight operation if you consider this could relate to millions of records.
If there is a supported workflow for changing the kind between versions it makes sense then to just allow users to change the kind
in general to allow them to more easily re-assign in an error scenario.
A different solution could be for the indexer service to see if the record exists anywhere and delete it before indexing a new record. However this service creates a different elastic index per kind. Searching all indexes for a single record is therefore expensive. Doing this for every record indexed would add a large burden on the system which is either going to result in performance and scalability issues or cost implications because you provision more resources to the elastic index.
Therefore a solution where we present more information to the index to allow it to find the specific id it would need to delete in this scenario makes sense to preserve performance and correctness.
Decision
We can resolve this by updating the record change event. Downstream services must be updated to consume these updated events as well.
Currently the record changed event is structured as below
{
"id":"opendes:Wellbore:id1",
"kind":"opendes:wks:master-data--Wellbore:1.0.0",
"op":"create"
}
We would like to
- Change
create
eventop
toupdate
when the record is being updated. Indexer already supports the 'update' operation as it does exist in Storage service today. However it is a bug that storage service sends the 'create' op even when a recod is actually being updated. We propose fixing that here. - Introduce a new attribute
previousVersionsKind
to show the kind on the previous version of the record
update
event without kind
change
{ "id":"opendes:Wellbore:id1", "kind":"opendes:wks:master-data--Wellbore:1.0.0", "op":"update", "previousVersionsKind": "opendes:wks:master-data--Wellbore:1.0.0" }
update
event with kind
changed
{ "id":"opendes:Wellbore:id1", "kind":"opendes:wks:master-data--Wellbore:1.0.1", "op":"update", "previousVersionsKind":"opendes:wks:master-data--Wellbore:1.0.0" }
Consequences
- Record change event needs to send previousVersionsKind property (common logic change)
- Record changed event needs to send Update op (common logic change)
- Register service and accompanying documentation needs to update the example record changed event information to reflect the new property in list topics API (common logic change)
- Storage Patch API should support updating the kind field on Storage records
- Indexer needs to check if the kind has changed between updates and delete the previous instance if it has (common logic change)