ADR: Array of Objects support by Indexer
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context
Currently, Indexer and Search implementation ignores Arrays of Objects structures in schemas. One of the examples of such structure in DD schemas are WellLog.Curves
You can find historical context in the following issues:
As a result, users are not able to:
- Retrieve such objects using Search (fixed in MR !114 (merged))
- Do search against info in such objects
- Do complex search queries e.g range / spatial etc
Elasticsearch has several options and data types to handle such cases. There is no silver bullet and each of these types has pros and cons.
- object type - Individual object queries are not supported (objects are not searchable individually)
- nested type - Have a serious impact on performance and produce separate documents e.g. 100 WellLog records with 100 Curves each will produce 100x100 documents under Elastic index.
- flattened type - Do not support complex queries and available ONLY with X-Pack (Not open-source type)
A specific object might be treated in one of these 3 ways according to a custom hint associated with that object (part of the logical schema). The main question is how the initial DD schema enrichment mechanism would work to add the hints. Generally, there are 2 options:
- Incorporate hints into the DD schema.
- Separate DD schema and hints and merge them during the Elasticsearch index creation process.
Both options have pros and cons:
- Combined schema and hints are the easiest and centralized way to onboard new functionality. Drawbacks are the blended "clean" data model with functional-level attributes and changes of the DD schema.
- Separated data model and hints follow the separation of concerns principle, but bring other challenges such as maintaining the conformity of 2 physically separated pieces and the necessity to store hints per provider.
The format of the hints file might be one of the following:
- Copy the original JSON structure and store hints maintaining the original DD schema hierarchy:
{ "properties" : { "data": { "allOf" : { "properties" : { "Curves" : "x-osdu-flattened" } } } } }
- Maintain a path to the desired field and its type:
{ "properties/data/allOf/properties/Curves" : "x-osdu-flattened" }
- Assuming a type is always indexing in some way, the mapping might be done on an object level:
{ "Curves" : "x-osdu-flattened" }
In each case, the maintainer should make sure the hints file matches the DD file (e.g. DD schema changes), which adds operational overhead.
Scope
Implement a general approach on how to handle Arrays of Objects (AO) in Schemas:
- Index
- Search
Decision
Analyzing 2 options, decided to follow the combined schema and hints approach:
- Define generic hints (enriched object schema), which will let the Indexer know, how the specific array of objects should be treated when feeding a schema into Elasticsearch:
"x-osdu-indexing": "x-type-nested"
"x-osdu-indexing": "x-type-flattened"
"x-osdu-indexing": "x-type-object"
- Review R3 schemas and inject hints where applicable.
Rationale
Elasticsearch doesn't have a type that can be used across all array of objects elements. The decision on a proper data type should be done as a part of the Data Modeling phase. The following criteria should be taken into account in each DD case:
- Is object attributes will be used in queries?
- What type of queries will be used?
- What is the cardinality of AO?
- Can information be moved out of AO?
Consequences
- Review R3 schemas
- Update all cases where we have AO with appropriate meta attribute
- Educate the DD community on the pros and cons of each type