ADR: Configurable Index Extensions and De-Normalizations
- Status
- Context & Scope
- Tradeoff Analysis
- Solution
- Change Management
- Decision
- Consequences
- ADR Comments Below
Originally recorded during June 28-30, 2022 F2F as "Hints replacements, multiple index schemas (participation of indexer & data definition needs to be in charge), content vs catalog, side-car", then renamed to ADR: User-friendly/App-friendly Index Schemas in Enterprise Architecture ADR #66
Preparation Material
OSDU Data Definitions conducted a number of sessions in the Core Concepts meetings, which contain supplementary information:
2022
- Meeting Minutes 2022-07-05
- Meeting Minutes 2022-07-12
- Meeting Minutes 2022-07-19
- Meeting Minutes 2022-07-26
2023
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
The entity type schemas delivered by the OSDU Data definitions subcommittee pose a number of challenges
for consumers. Most of them are due to the normalization of schemas and the friendliness to ingestors, which allows
storage of values as is and less standardized. The main problem is the usage of arrays of objects, which are difficult
when forming queries and cause costs for indexing. So far the issues have been mitigated by decorating arrays of objects
with x-osdu-indexing
instructions. An umbrella issue has been recorded in
community DD issue #30, which collects a
numer of more detailed requests.
In previous OSDU prototypes, this was addressed by specific workarounds, see OSDU R1 Indexing Approach and Specification.
Here an attempt to classify the workarounds listed in the R1 document above:
- Extraction of standardized values from arrays of objects using conditions (e.g., Well UWI, SpudDate).
- Chasing relationships to parent or related objects in order to de-normalize parent/related object values on children.
- Offering related object's Name/Code for presentations in applications.
- Counting children of well-known kinds. (The priority of this is lower compared to 1 and 2. The current Search service should be capable of performing querying a particular parent-child relationship.)
The current methods using x-osdu-virtual-properties
, x-osdu-is-derived
and x-osdu-indexing
JSON schema decorations
fall short when the query conditions become dependent on platform operators usage of, e.g., reference values. In many
cases the reference value lists shipped by OSDU are incomplete or not clearly enough documented to guide global platform
standards.
Requirements
- We need a configurable way to define rules for property extraction, either from nested arrays of objects or from related objects.
- We need OSDU provided standard index schema extensions to extend the entity types schemas with extracted values. ( Governance for interoperability)
- We need to open the index schema extensions to applications and services to optimize frequently used query patterns. One of them is the look-up of names or codes of related objects where the source record holds the target record id.
- We need a platform embedded service, which performs the extractions and de-normalizations on demand (data creation/update events)
- we need platform support to refresh indexes if the indexing schemas change (both for OSDU and application indexing schemas).
Tradeoff Analysis
The original tradeoff analysis was performed and recorded in EA ADR #66. The need for performance required further simplification.
- Replicating derived/de-normalized property values in Storage records was discarded as this would create an enormous stack of versions for each individual record as records would need to be updated if properties derived from parents or children changed.
- Instead, de-normalization could happen exclusively in the indexer, simultaneously exploiting the already indexed values of parent and children records. (Preferred option)
- Using configurable index extension rules was already proposed in EA ADR #66. The proposed additional index schemas with references to configurations were discarded. All required information can be encoded in the configurations themselves. Any index extension schema fragments and documentation can be auto-generated from the configurations.
- Interoperability is achieved by firm governance rules - the configurations are stored and customizable as OPEN governance reference-data. However, additional governance rules have to be provided to keep interoperability guaranteed across deployments and to prevent unwanted interference of index extensions with actual schema properties.
Solution
Index Extension, Data Definition
OSDU Standard index extensions are defined by OSDU Data Definition work-streams with the intent to provide user/application friendly, derived properties. The standard set, together with the OSDU schemas, form the interoperability foundation. They can contribute to deliver domain specific APIs according to the Domain Driven Design principles.
The configurations are encoded in OSDU reference-data records, one per each major schema version. The proposed type name is IndexPropertyPathConfiguration. The diagram below shows the decomposition into parts.
- One IndexPropertyPathConfiguration record corresponds to one schema kind's major version, i.e., the
IndexPropertyPathConfiguration record id for all the
schema osdu:wks:master-data--Wellbore:1.*.*
kinds is set topartition-id:reference-data--IndexPropertyPathConfiguration:osdu:wks:master-data--Wellbore:1
. Code, Name and Descriptions are filled with meaningful data as usual for all reference-data types. - The additional index properties are added with one JSON object each in the
Configurations[]
array. The Name defined the name of the index 'column', or the name of the property one can search for. The Policy decides, in the current usage, whether the resulting value is a single value or an array containing the aggregated, derived values. - Each
Configurations[]
element has at least one element defined inPaths[]
. - The
ValueExtraction
object has one mandatory property,ValuePath
. The other optional two properties hold value match conditions, i.e., the property containing the value to be matched and the value to match. - If no
RelatedObjectsSpec
is present, the value is derived from the object being indexed. - If
RelatedObjectsSpec
is provided, the value extraction is carried out in related objects - depending on theRelationshipDirection
indirection parent/related object or children. The property holding the record id to follow is specified inRelatedObjectID
, so is the expected target kind. As inValueExtraction
, the selection can be filtered by a match condition (RelatedConditionProperty
andRelatedConditionMatches
)
With this, the extension properties can be defined as if they were provided by a schema.
Most of the use cases deal with text (string) types. The definition of configurations is however not limited to string
types. As long as the property is known to the indexer, i.e., the source record schema is describing the types, the type
can be inferred by the indexer. This does not work for nested arrays of objects, which have not been indexed
with "x-osdu-indexing": {"type":"nested"}
. In this case the types unknown to teh Indexer Service are
string-serialized; the resulting index type is then of type string
, still supporting text search.
Use Case 1, WellUWI
As a user I want to discover and match Wells by their UWI. I am aware that this is not globally reliable, however, I am able to specify a prioritized AliasNameType list to look up value in the NameAliases array.
The configuration demonstrates extractions from the record being indexed itself. With Policy ExtractFirstMatch
, the
first value matching the condition RelatedConditionProperty
is equal to one of RelatedConditionMatches
.
Configuration for Well, extract WellUWI from NameAliases[]
{
"data": {
"Configurations": [
{
"Name": "WellUWI",
"Policy": "ExtractFirstMatch",
"Paths": [
{
"ValueExtraction": {
"RelatedConditionMatches": [
"{{data-partition-id}}:reference-data--AliasNameType:UniqueIdentifier:",
"{{data-partition-id}}:reference-data--AliasNameType:RegulatoryName:",
"{{data-partition-id}}:reference-data--AliasNameType:PreferredName:",
"{{data-partition-id}}:reference-data--AliasNameType:CommonName:"
],
"RelatedConditionProperty": "data.NameAliases[].AliasNameTypeID",
"ValuePath": "data.NameAliases[].AliasName"
}
}
],
"UseCase": "As a user I want to discover and match Wells by their UWI. I am aware that this is not globally reliable, however, I am able to specify a prioritized AliasNameType list to look up value in the NameAliases array."
}
]
}
}
Use Case 2, CountryNames
As a user I want to find objects by a country name, with the understanding that an object may extend over country boundaries.
This configuration demonstrates the extraction from related index objects - here RelatedObjectKind
being osdu:wks:master-data--GeoPoliticalEntity:1.
, which are found via RelatedObjectID
as
in data.GeoContexts[].GeoPoliticalEntityID
. The condition is constrained to be that GeoTypeID is
GeoPoliticalEntityType:Country.
Configuration for Well, extract CountryNames from GeoContexts[]
{
"data": {
"Configurations": [
{
"Name": "CountryNames",
"Policy": "ExtractAllMatches",
"Paths": [
{
"RelatedObjectsSpec": {
"RelatedObjectID": "data.GeoContexts[].GeoPoliticalEntityID",
"RelatedObjectKind": "osdu:wks:master-data--GeoPoliticalEntity:1.",
"RelatedConditionMatches": [
"{{data-partition-id}}:reference-data--GeoPoliticalEntityType:Country:"
],
"RelatedConditionProperty": "data.GeoContexts[].GeoTypeID"
},
"ValueExtraction": {
"ValuePath": "data.GeoPoliticalEntityName"
}
}
],
"UseCase": "As a user I want to find objects by a country name, with the understanding that an object may extend over country boundaries."
}
]
}
}
Use Case 3, Wellbore Name on WellLog Children
As a user I want to discover WellLog instances by the wellbore's name value.
A variant of this can be WellUWI from parent Wellbore → Well; in that case the value would be derived from the already extended index values.
This configuration demonstrates extractions from multiple Paths[]
.
Configuration for WellLog, extract WellboreName from parent WellboreID
{
"data": {
"Configurations": [
{
"Name": "WellboreName",
"Policy": "ExtractFirstMatch",
"Paths": [
{
"RelatedObjectsSpec": {
"RelatedObjectKind": "osdu:wks:master-data--Wellbore:1.",
"RelatedObjectID": "data.WellboreID"
},
"ValueExtraction": {
"ValuePath": "data.VirtualProperties.DefaultName"
}
},
{
"RelatedObjectsSpec": {
"RelatedObjectKind": "osdu:wks:master-data--Wellbore:1.",
"RelatedObjectID": "data.WellboreID"
},
"ValueExtraction": {
"ValuePath": "data.FacilityName"
}
}
],
"UseCase": "As a user I want to discover WellLog instances by the wellbore's name value."
}
]
}
}
Use Case 4, Wellbore index WellLogCurveMnemonics
As a user I want to find Wellbores by well log mnemonics.
This configuration demonstrates the Policy ExtractAllMatches
with related objects discovered by
RelationshipDirection ParentToChildren
, i.e., related objects referring the indexed record.
Configuration for WellLog, extract WellboreName from parent WellboreID
{
"data": {
"Configurations": [
{
"Name": "WellLogCurveMnemonics",
"Policy": "ExtractAllMatches",
"Paths": [
{
"RelatedObjectsSpec": {
"RelationshipDirection": "ParentToChildren",
"RelatedObjectID": "WellboreID",
"RelatedObjectKind": "osdu:wks:work-product-component--WellLog:1."
},
"ValueExtraction": {
"ValuePath": "Curves[].Mnemonic"
}
}
],
"UseCase": "As a user I want to find Wellbores by well log mnemonics."
}
]
}
}
Index Extension, Governance
OSDU Data Definition ships reference value list content for all reference-data group-type entities. The type IndexPropertyPathConfiguration is classified as OPEN governance, which usually means that new records can be added by platform operators. This rule must be adjusted for IndexPropertyPathConfiguration records.
Permitted Changes to IndexPropertyPathConfiguration Records
It is permitted to
- customize the conditions for value extractions, notable the matching values in
RelatedConditionMatches
. - add additional
Paths[]
elements toConfigurations[].Paths[]
- add new index property configuration objects to the
Configurations[]
array. To avoid interference with future OSDU updates it is strongly recommended to add a namespace prefix to the Configurations[].Name, e.g., "OperatorX.WellUWI".
Prohibited Changes to IndexPropertyPathConfiguration Records
It is not permitted to
- change the target value type of existing, OSDU shipped index extensions. Example the
ExtractionPath
to a string property in the original OSDUConfigurations[].ValueExtraction.ValuePath
must not be altered to a number, integer, or array. - change the meaning of existing, OSDU shipped index extensions.
- remove OSDU shipped extension definitions in Configurations[].
Consumption by Indexer Service
Recursive Index Updates
With the introduction of de-normalizations record updates can cause infinite recursions. The implementation needs to address this and avoid situations like in the following diagram:
On the left hand Storage records are updated to new versions, which trigger indexing. The update of the index triggers
the index update of related index records due to the derived property values (as defined in the RelatedObjectsSpec
).
These updates may, in turn, cause a recursion. This must not happen.
The augmenter introduces a new attribute ancestry_kinds
in the Attributes map of the message payload when sending
messages to update the index of parent/children records. The value of ancestry_kinds
attribute can include multiple
kinds separated by comma. This new attribute is used to prevent infinite loop of the index chasing. The indexer-queue
must pass the attribute back to the indexer when it receives indexing messages.
Pseudo-Code
-
For each record to be indexed (create/update event from Storage service):
- Has the record kind a IndexPropertyPathConfiguration?
- Yes
- get or create the internal index schema that combines the schema of the record kind and schema of extended properties
- create index document that combines the properties of original record and extended properties
- call ElasticSearch service to create or update the index of the record with extended properties
- No
- No action (=default for records without IndexPropertyPathConfiguration)
- Yes
- Has the record kind a IndexPropertyPathConfiguration?
-
Re-Indexing (create/update event from Storage service for a IndexPropertyPathConfiguration record)
To update the schema (or say template) of the kind in ElasticSearch when the kind is re-indexed:- create the internal index schema derived from the kind (as registered in the Schema service)
- create the internal index schema derived from IndexPropertyPathConfiguration
- merge the internal index schemas
- convert the schema to ElasticSearch template
- call ElasticSearch service to update the index template (schema)
Accepted Limitations
-
A change in the configurations requires re-indexing of all the records of a major schema version kind. It is the same limitation as an in-place schema change for any kind.
-
All the extensions defined in the IndexPropertyPathConfiguration records refer to properties in the
data
block, includingValuePath
,RelatedObjectID
,RelatedConditionProperty
. -
Only properties in the
data
block of records being indexed can be reached by theValuePath
; system properties are out of reach. The prefixdata.
is therefore optional and can be omitted. -
The formats/values of the extended properties are extracted from the formats/values of the related index records. If the formats of the original properties are unknown in the related index records, the indexer will set the value type of the extended properties as string or string array. (With additional complexity and schema parsing, this limitation can be overcome, but currently the added value seems to be marginal.)
-
If the extended properties are extracted from arrays of objects indexed with (
"x-osdu-indexing": {"type":"flattened"}
), the indexer cannot re-construct the object properties to the nested objects when the policyExtractAllMatches
is applied. (The kind of indexing is already a deliberate choice. With additional complexity, this limitation can be overcome, but currently the added value seems to be marginal.) -
To simplify the solution, all the related kinds defined in the configuration are kinds with major version only. They must end with dot ".". For example:
"RelatedObjectKind": "osdu:wks:work-product-component--WellLog:1."
. -
Index updates may take time. Immediate consistency cannot be expected.
-
When a kind derives extended properties from its parent(s), a new data property
data.AssociatedIdentities
is added on demand by the indexer. The property nameAssociatedIdentities
is therefore reserved by the Indexer and shall not be used in any OSDU schemas. Currently, the property nameAssociatedIdentities
is not in use in any of the OSDU well-known schemas. Tests will be implemented in the OSDU Data Definition pipeline to ensure that this reserved name does not appear as property in thedata
block.
Change Management
- Configurations are reference-data and need to be ingested/updated.
- OSDU Data Definitions must take on the task of defining IndexPropertyPathConfiguration records.
- Updates (extensions) of index extensions must be managed carefully as they cause re-indexing the kinds involved.
Decision
Consequences
- The indexer code changes should have no impact on the system if no IndexPropertyPathConfiguration records are present.