Indexer issueshttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues2024-02-14T18:00:03Zhttps://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/81ADR: Configurable Index Extensions and De-Normalizations2024-02-14T18:00:03ZThomas Gehrmann [slb]ADR: Configurable Index Extensions and De-Normalizations<a name="TOC"></a>
[[_TOC_]]
Originally recorded during June 28-30, 2022 F2F as "Hints replacements, multiple index schemas (participation of indexer
& data definition needs to be in charge), content vs catalog, side-car", then renamed...<a name="TOC"></a>
[[_TOC_]]
Originally recorded during June 28-30, 2022 F2F as "Hints replacements, multiple index schemas (participation of indexer
& data definition needs to be in charge), content vs catalog, side-car", then renamed to ADR: User-friendly/App-friendly
Index Schemas
in [Enterprise Architecture ADR #66](https://gitlab.opengroup.org/osdu/subcommittees/ea/work-products/adr-elaboration/-/issues/66)
<details>
<summary markdown="span">Preparation Material</summary>
OSDU Data Definitions conducted a number of sessions in the Core Concepts meetings, which contain supplementary
information:
**2022**
1. [Meeting Minutes 2022-07-05](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/core-concepts/docs/-/blob/master/Meeting%20Minutes/2022/2022-07-05-DataDefinitionsCoreConcepts_MeetingMinutes.md#42-user-friendly-schemas-de-normalizations)
2. [Meeting Minutes 2022-07-12](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/core-concepts/docs/-/blob/master/Meeting%20Minutes/2022/2022-07-12-DataDefinitionsCoreConcepts_MeetingMinutes.md#43-user-friendly-schemas-aka-index-schemas)
3. [Meeting Minutes 2022-07-19](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/core-concepts/docs/-/blob/master/Meeting%20Minutes/2022/2022-07-19-DataDefinitionsCoreConcepts_MeetingMinutes.md#43-user-friendly-schemas-aka-index-schemas)
4. [Meeting Minutes 2022-07-26](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/core-concepts/docs/-/blob/master/Meeting%20Minutes/2022/2022-07-26-DataDefinitionsCoreConcepts_MeetingMinutes.md#42-user-friendly-schemas-aka-index-schemas)
**2023**
1. [Meeting Minutes 2023-03-21](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/core-concepts/docs/-/blob/master/Meeting%20Minutes/2023/2023-03-21-DataDefinitionsCoreConcepts_MeetingMinutes.md#42-index-extensions-adr-66-configuration)
2. [Meeting Minutes 2023-03-28](https://gitlab.opengroup.org/osdu/subcommittees/data-def/projects/core-concepts/docs/-/blob/master/Meeting%20Minutes/2023/2023-03-28-DataDefinitionsCoreConcepts_MeetingMinutes.md#42-index-extensions-configuration-mechanics-schema-review)
3. [Enterprise Architecture Advice Forum 2023-04-12](https://opensdu.slack.com/archives/C04TPV9CRUP/p1681291140407219?thread_ts=1681217870.084929&cid=C04TPV9CRUP)
</details>
# Status
- [x] Proposed
- [x] Trialing
- [x] Under review
- [x] Approved
- [ ] Retired
# Context & Scope
The entity type schemas delivered by the OSDU Data definitions subcommittee pose a number of challenges
for consumers. Most of them are due to the normalization of schemas and the friendliness to ingestors, which allows
storage of values as is and less standardized. The main problem is the usage of arrays of objects, which are difficult
when forming queries and cause costs for indexing. So far the issues have been mitigated by decorating arrays of objects
with `x-osdu-indexing` instructions. An umbrella issue has been recorded in
[community DD issue #30](https://community.opengroup.org/osdu/data/data-definitions/-/issues/30), which collects a
numer of more detailed requests.
In previous OSDU prototypes, this was addressed by specific workarounds,
see [OSDU R1 Indexing Approach and Specification](https://gitlab.opengroup.org/osdu/subcommittees/ea/work-products/adr-elaboration/-/wikis/uploads/46b4f84f0903cc385abd147a0175a00a/r1_indexing.pdf).
Here an attempt to classify the workarounds listed in the R1 document above:
1. Extraction of standardized values from arrays of objects using conditions (e.g., Well UWI, SpudDate).
2. Chasing relationships to parent or related objects in order to de-normalize parent/related object values on children.
3. Offering related object's Name/Code for presentations in applications.
4. Counting children of well-known kinds. (The priority of this is lower compared to 1 and 2. The current Search service
should be capable of performing querying a particular parent-child relationship.)
The current methods using `x-osdu-virtual-properties`, `x-osdu-is-derived` and `x-osdu-indexing` JSON schema decorations
fall short when the query conditions become dependent on platform operators usage of, e.g., reference values. In many
cases the reference value lists shipped by OSDU are incomplete or not clearly enough documented to guide global platform
standards.
[Back to TOC](#TOC)
---
## Requirements
* We need a configurable way to define rules for property extraction, either from nested arrays of objects or from
related objects.
* We need OSDU provided standard index schema extensions to extend the entity types schemas with extracted values. (
Governance for interoperability)
* We need to open the index schema extensions to applications and services to optimize frequently used query patterns.
One of them is the look-up of names or codes of related objects where the source record holds the target record id.
* We need a platform embedded service, which performs the extractions and de-normalizations on demand (data
creation/update events)
* we need platform support to refresh indexes if the indexing schemas change (both for OSDU and application indexing
schemas).
[Back to TOC](#TOC)
---
# Tradeoff Analysis
The original tradeoff analysis was performed and recorded
in [EA ADR #66](https://gitlab.opengroup.org/osdu/subcommittees/ea/work-products/adr-elaboration/-/issues/66).
The need for performance required further simplification.
* Replicating derived/de-normalized property values in Storage records was discarded as this would create an enormous
stack of versions for each individual record as records would need to be updated if properties derived from parents or
children changed.
* Instead, de-normalization could happen exclusively in the indexer, simultaneously exploiting the already indexed
values of parent and children records. (Preferred option)
* Using configurable index extension rules was already proposed
in [EA ADR #66](https://gitlab.opengroup.org/osdu/subcommittees/ea/work-products/adr-elaboration/-/issues/66). The
proposed additional index schemas with references to configurations were discarded. All required information can be
encoded in the configurations themselves. Any index extension schema fragments and documentation can be auto-generated
from the configurations.
* Interoperability is achieved by firm governance rules - the configurations are stored and customizable as OPEN
governance reference-data. However, additional governance rules have to be provided to keep interoperability
guaranteed across deployments and to prevent unwanted interference of index extensions with actual schema properties.
[Back to TOC](#TOC)
---
# Solution
## Index Extension, Data Definition
OSDU Standard index extensions are defined by OSDU Data Definition work-streams with the intent to provide
user/application friendly, derived properties. The standard set, together with the OSDU schemas, form the
interoperability foundation. They can contribute to deliver domain specific APIs according to the Domain Driven Design
principles.
The configurations are encoded in OSDU reference-data records, one per each major schema version. The proposed type name
is IndexPropertyPathConfiguration. The diagram below shows the decomposition into parts.
![IndexPropertyPathConfiguration](/uploads/7f1330dd7a41903a90174feb7fe2c9d9/IndexPropertyPathConfiguration.png)
* One IndexPropertyPathConfiguration record corresponds to one schema kind's major version, i.e., the
IndexPropertyPathConfiguration record id for all the `schema osdu:wks:master-data--Wellbore:1.*.*` kinds is set
to `partition-id:reference-data--IndexPropertyPathConfiguration:osdu:wks:master-data--Wellbore:1`. Code, Name and
Descriptions are filled with meaningful data as usual for all reference-data types.
* The additional index properties are added with one JSON object each in the `Configurations[]` array. The Name defined
the name of the index 'column', or the name of the property one can search for. The Policy decides, in the current
usage, whether the resulting value is a single value or an array containing the aggregated, derived values.
* Each `Configurations[]` element has at least one element defined in `Paths[]`.
* The `ValueExtraction` object has one mandatory property, `ValuePath`. The other optional two properties hold value
match conditions, i.e., the property containing the value to be matched and the value to match.
* If no `RelatedObjectsSpec` is present, the value is derived from the object being indexed.
* If `RelatedObjectsSpec` is provided, the value extraction is carried out in related objects - depending on
the `RelationshipDirection` indirection parent/related object or children. The property holding the record id to
follow is specified in `RelatedObjectID`, so is the expected target kind. As in `ValueExtraction`, the selection can
be filtered by a match condition (`RelatedConditionProperty` and `RelatedConditionMatches`)
With this, the extension properties can be defined as if they were provided by a schema.
Most of the use cases deal with text (string) types. The definition of configurations is however not limited to string
types. As long as the property is known to the indexer, i.e., the source record schema is describing the types, the type
can be inferred by the indexer. This does not work for nested arrays of objects, which have not been indexed
with `"x-osdu-indexing": {"type":"nested"}`. In this case the types unknown to teh Indexer Service are
string-serialized; the resulting index type is then of type `string`, still supporting text search.
[Back to TOC](#TOC)
---
### Use Case 1, WellUWI
_As a user I want to discover and match Wells by their UWI. I am aware that this is not globally reliable, however, I am
able to specify a prioritized AliasNameType list to look up value in the NameAliases array._
The configuration demonstrates extractions from the record being indexed itself. With Policy `ExtractFirstMatch`, the
first value matching the condition `RelatedConditionProperty` is equal to one of `RelatedConditionMatches`.
<details><summary>Configuration for Well, extract WellUWI from NameAliases[]</summary>
```json
{
"data": {
"Configurations": [
{
"Name": "WellUWI",
"Policy": "ExtractFirstMatch",
"Paths": [
{
"ValueExtraction": {
"RelatedConditionMatches": [
"{{data-partition-id}}:reference-data--AliasNameType:UniqueIdentifier:",
"{{data-partition-id}}:reference-data--AliasNameType:RegulatoryName:",
"{{data-partition-id}}:reference-data--AliasNameType:PreferredName:",
"{{data-partition-id}}:reference-data--AliasNameType:CommonName:"
],
"RelatedConditionProperty": "data.NameAliases[].AliasNameTypeID",
"ValuePath": "data.NameAliases[].AliasName"
}
}
],
"UseCase": "As a user I want to discover and match Wells by their UWI. I am aware that this is not globally reliable, however, I am able to specify a prioritized AliasNameType list to look up value in the NameAliases array."
}
]
}
}
```
</details>
[Back to TOC](#TOC)
---
### Use Case 2, CountryNames
_As a user I want to find objects by a country name, with the understanding that an object may extend over country
boundaries._
This configuration demonstrates the extraction from related index objects - here `RelatedObjectKind`
being `osdu:wks:master-data--GeoPoliticalEntity:1.`, which are found via `RelatedObjectID` as
in `data.GeoContexts[].GeoPoliticalEntityID`. The condition is constrained to be that GeoTypeID is
GeoPoliticalEntityType:Country.
<details><summary>Configuration for Well, extract CountryNames from GeoContexts[]</summary>
```json
{
"data": {
"Configurations": [
{
"Name": "CountryNames",
"Policy": "ExtractAllMatches",
"Paths": [
{
"RelatedObjectsSpec": {
"RelatedObjectID": "data.GeoContexts[].GeoPoliticalEntityID",
"RelatedObjectKind": "osdu:wks:master-data--GeoPoliticalEntity:1.",
"RelatedConditionMatches": [
"{{data-partition-id}}:reference-data--GeoPoliticalEntityType:Country:"
],
"RelatedConditionProperty": "data.GeoContexts[].GeoTypeID"
},
"ValueExtraction": {
"ValuePath": "data.GeoPoliticalEntityName"
}
}
],
"UseCase": "As a user I want to find objects by a country name, with the understanding that an object may extend over country boundaries."
}
]
}
}
```
</details>
[Back to TOC](#TOC)
---
### Use Case 3, Wellbore Name on WellLog Children
_As a user I want to discover WellLog instances by the wellbore's name value._
A variant of this can be WellUWI from parent Wellbore → Well; in that case the value would be derived from the
already extended index values.
This configuration demonstrates extractions from multiple `Paths[]`.
<details><summary>Configuration for WellLog, extract WellboreName from parent WellboreID</summary>
```json
{
"data": {
"Configurations": [
{
"Name": "WellboreName",
"Policy": "ExtractFirstMatch",
"Paths": [
{
"RelatedObjectsSpec": {
"RelatedObjectKind": "osdu:wks:master-data--Wellbore:1.",
"RelatedObjectID": "data.WellboreID"
},
"ValueExtraction": {
"ValuePath": "data.VirtualProperties.DefaultName"
}
},
{
"RelatedObjectsSpec": {
"RelatedObjectKind": "osdu:wks:master-data--Wellbore:1.",
"RelatedObjectID": "data.WellboreID"
},
"ValueExtraction": {
"ValuePath": "data.FacilityName"
}
}
],
"UseCase": "As a user I want to discover WellLog instances by the wellbore's name value."
}
]
}
}
```
</details>
[Back to TOC](#TOC)
---
### Use Case 4, Wellbore index WellLogCurveMnemonics
_As a user I want to find Wellbores by well log mnemonics._
This configuration demonstrates the Policy `ExtractAllMatches` with related objects discovered by
RelationshipDirection `ParentToChildren`, i.e., related objects referring the indexed record.
<details><summary>Configuration for WellLog, extract WellboreName from parent WellboreID</summary>
```json
{
"data": {
"Configurations": [
{
"Name": "WellLogCurveMnemonics",
"Policy": "ExtractAllMatches",
"Paths": [
{
"RelatedObjectsSpec": {
"RelationshipDirection": "ParentToChildren",
"RelatedObjectID": "WellboreID",
"RelatedObjectKind": "osdu:wks:work-product-component--WellLog:1."
},
"ValueExtraction": {
"ValuePath": "Curves[].Mnemonic"
}
}
],
"UseCase": "As a user I want to find Wellbores by well log mnemonics."
}
]
}
}
```
</details>
[Back to TOC](#TOC)
---
## Index Extension, Governance
OSDU Data Definition ships reference value list content for all reference-data group-type entities. The type
IndexPropertyPathConfiguration is classified as OPEN governance, which usually means that new records can be added by
platform operators. This rule must be adjusted for IndexPropertyPathConfiguration records.
### Permitted Changes to IndexPropertyPathConfiguration Records
It is permitted to
* customize the conditions for value extractions, notable the matching values in `RelatedConditionMatches`.
* add additional `Paths[]` elements to `Configurations[].Paths[]`
* add new index property configuration objects to the `Configurations[]` array. To avoid interference with future OSDU
updates it is strongly recommended to add a namespace prefix to the Configurations[].Name, e.g., "OperatorX.WellUWI".
### Prohibited Changes to IndexPropertyPathConfiguration Records
It is not permitted to
* change the target value type of existing, OSDU shipped index extensions. Example the `ExtractionPath` to a string
property in the original OSDU `Configurations[].ValueExtraction.ValuePath` must not be altered to a number, integer,
or array.
* change the meaning of existing, OSDU shipped index extensions.
* remove OSDU shipped extension definitions in Configurations[].
[Back to TOC](#TOC)
---
## Consumption by Indexer Service
### Recursive Index Updates
With the introduction of de-normalizations record updates can cause infinite recursions. The implementation needs to
address this and avoid situations like in the following diagram:
![Recursions](/uploads/020675583cb7b65560f0d73ffe08fc3c/Recursions.png)
On the left hand Storage records are updated to new versions, which trigger indexing. The update of the index triggers
the index update of related index records due to the derived property values (as defined in the `RelatedObjectsSpec`).
These updates may, in turn, cause a recursion. This must not happen.
The augmenter introduces a new attribute `ancestry_kinds` in the Attributes map of the message payload when sending
messages to update the index of parent/children records. The value of `ancestry_kinds` attribute can include multiple
kinds separated by comma. This new attribute is used to prevent infinite loop of the index chasing. The indexer-queue
must pass the attribute back to the indexer when it receives indexing messages.
### Pseudo-Code
1. For each record to be indexed (create/update event from Storage service):
* Has the record kind a IndexPropertyPathConfiguration?
* Yes
* get or create the internal index schema that combines the schema of the record kind and schema of extended
properties
* create index document that combines the properties of original record and extended properties
* call ElasticSearch service to create or update the index of the record with extended properties
* No
* **_No action_** (=default for records without IndexPropertyPathConfiguration)
2. Re-Indexing (create/update event from Storage service for a IndexPropertyPathConfiguration record)<br>
To update the schema (or say template) of the kind in ElasticSearch when the kind is re-indexed:
* create the internal index schema derived from the kind (as registered in the Schema service)
* create the internal index schema derived from IndexPropertyPathConfiguration
* merge the internal index schemas
* convert the schema to ElasticSearch template
* call ElasticSearch service to update the index template (schema)
[Back to TOC](#TOC)
---
## Accepted Limitations
* A change in the configurations requires re-indexing of all the records of a major schema version kind. It is the same
limitation as an in-place schema change for any kind.
* All the extensions defined in the IndexPropertyPathConfiguration records refer to properties in the `data` block,
including `ValuePath`, `RelatedObjectID`, `RelatedConditionProperty`.
* Only properties in the `data` block of records being indexed can be reached by the `ValuePath`; system properties are
out of reach. The prefix `data.` is therefore optional and can be omitted.
* The formats/values of the extended properties are extracted from the formats/values of the related index records. If
the formats of the original properties are unknown in the related index records, the indexer will set the value type
of the extended properties as string or string array. (With additional complexity and schema parsing, this limitation
can be overcome, but currently the added value seems to be marginal.)
* If the extended properties are extracted from arrays of objects indexed with
(`"x-osdu-indexing": {"type":"flattened"}`), the indexer cannot re-construct the object properties to the
nested objects when the policy `ExtractAllMatches` is applied. (The kind of indexing is already a deliberate choice.
With additional complexity, this limitation can be overcome, but currently the added value seems to
be marginal.)
* To simplify the solution, all the related kinds defined in the configuration are kinds with major version only. They
must end with dot ".". For example: `"RelatedObjectKind": "osdu:wks:work-product-component--WellLog:1."`.
* Index updates may take time. Immediate consistency cannot be expected.
* When a kind derives extended properties from its parent(s), a new data property `data.AssociatedIdentities` is added
on demand by the indexer. The property name `AssociatedIdentities` is therefore reserved by the Indexer and shall not
be used in any OSDU schemas.
Currently, the property name `AssociatedIdentities` is not in use in any of the OSDU well-known schemas. Tests will be
implemented in the OSDU Data Definition pipeline to ensure that this reserved name does not appear as property in
the `data` block.
[Back to TOC](#TOC)
---
# Change Management
1. Configurations are reference-data and need to be ingested/updated.
2. OSDU Data Definitions must take on the task of defining IndexPropertyPathConfiguration records.
3. Updates (extensions) of index extensions must be managed carefully as they cause re-indexing the kinds involved.
# Decision
# Consequences
* The indexer code changes should have no impact on the system if no IndexPropertyPathConfiguration records are present.
[Back to TOC](#TOC)
---
# ADR Comments BelowM18 - Release 0.21https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/73Indexer fails to correctly parse properties with special characters2022-08-23T15:08:44ZAn NgoIndexer fails to correctly parse properties with special charactersFor example:
```
"SpatialArea": {
"Wgs84Coordinates": {
"features": [
{
"geometry": {
"type": "Point",
"coordinates": [
2.2863,
61.198685
...For example:
```
"SpatialArea": {
"Wgs84Coordinates": {
"features": [
{
"geometry": {
"type": "Point",
"coordinates": [
2.2863,
61.198685
]
},
"properties": {
"id": "a:b"
},
"type": "Feature"
}
],
"type": "FeatureCollection"
}
}
```
Indexer fails to parse the properties id whose value contains a colon.https://community.opengroup.org/osdu/platform/system/indexer-service/-/issues/8Add schema service endpoint2021-01-11T21:42:06Zethiraj krishnamanaiduAdd schema service endpointIndexer core logic updated to integrate with schema service and merge to master.
Please update following.
Add Schema service endpoint(`SCHEMA_HOST`) in application.properties [Example](https://community.opengroup.org/osdu/platform/syst...Indexer core logic updated to integrate with schema service and merge to master.
Please update following.
Add Schema service endpoint(`SCHEMA_HOST`) in application.properties [Example](https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/provider/indexer-azure/src/main/resources/application.properties#L39)
Update Integration test class path [Example](https://community.opengroup.org/osdu/platform/system/indexer-service/-/blob/master/testing/indexer-test-azure/src/test/java/org/opengroup/osdu/step_definitions/index/record/RunTest.java#L23)ethiraj krishnamanaiduDania Kodeih (Microsoft)Wladmir FrazaoJoeDmitriy Rudkoethiraj krishnamanaidu