ADR: Nested query search
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context
With the recent changes of incorporating search hints into the data definition schema (indexer-service#16), some arrays might be marked as nested objects (https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html). The nested type is a specialized version of the object data type that allows arrays of objects to be indexed in a way that they can be queried independently of each other.
With the implementation of indexer-service#16, Elasticsearch has injected nested objects and can be queried directly (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-nested-query.html#nested-query-ex-query).
But Search service is currently unable to perceive or interpret such requests. It is necessary to modify the Search service to provide the function of initiation in queries of conditions based on the data of the arrays, indexed using the "nested" hint. At the same time, one should take into account the possibility of composing complex queries that combine several conditions, some of which refer to the array, and some to the main structure of the indexed document.
Scope
Extend Search service to support searching through arrays indexed with the "nested" hint:
- Extend the API query format for including search conditions for such arrays in the API
- Implement the interpretation of such conditions when translating a request into the final Elasticsearch format
Decision
To understand what the final format will require a complex search query for Elasticsearch, here is an example of field mapping for an index created on the basis of some simple fictional Scheme:
PUT /tools {
"mappings": {
"properties": {
"tool": {"type": "text"},
"properties": {"type": "nested"}
}
}
}
Here you can see one text field "tool" and one array "properties". Let's index two documents:
POST / _bulk
{"index": {"_index": "tools", "_id": "1"}}
{"tool": "hammer", "properties": [{"brand": "ABC", "country": "USA"}, {"weight": 1000}]}
{"index": {"_index": "tools", "_id": "2"}}
{"tool": "screwdriver", "properties": [{"brand": "XYZ", "country": "USSR"}, {"weight": 500}]}
The simplest query on the fields of these documents, which we could make through the Search API, would look like this:
POST {{SEARCH_HOST}}/query
{
"kind": "osdu:toolmarket:tools:1.0.0",
"query": "data.Tool:\"hammer\""
}
And as a result, during the transformation in the Search code, the following request would be sent to Elasticsearch:
GET /tools/_search {
{
"query": {
"bool": {
"must": [
{"match": {"tool": { "query": "hammer" }}}
]
}
}
}
But to search through an array indexed by the "nested" hint, Elasticsearch needs a special syntax. This is how the composite query for the "hammer of the US-registered ABC brand" looks like:
GET /tools/_search {
{
"query": {
"bool": {
"must": [
{"match": {"tool": { "query": "hammer" }}},
{"nested": {
"path": "properties",
"query": {
"bool": {
"must": [
{"match": {"properties.brand": {"query": "ABC"}}} ,
{"match": {"properties.country": {"query": "USA"}}}
]
}
}
}
}
]
}
}
}
We see the inclusion of the "nested" node, but also an additional "path" hint, which says in which of the "nested" arrays this subquery should be executed.
And this kind of conditions in the API Search service is not currently supported. It should be added with minimal complication when the end user compiles a request.
The following format is suggested:
POST {{SEARCH_HOST}}/query
{
"kind": "osdu:toolmarket:tools:1.0.0",
"query": "(data.Tool:\"hammer\") AND nested(data.Properties, (brand:\"ABC\" AND country: \"USA\"))"
}
As you can see, this format introduces a function nested(path, query)
.
The first argument specifies the path to the "nested" search array, and the second specifies the request body, where field names are truncated by removing the path specified in the first argument. This construction is easy to understand and easy to parse inside the Search service code.
Now let's complicate the query by adding a “range” condition to search by tool weight. The "weight" property is defined in a separate property (a separate array item object), then we need two subqueries of the "nested" type:
POST {{SEARCH_HOST}}/query
{
"kind": "osdu:toolmarket:tools:1.0.0",
"query": "(data.Tool:\"hammer\")
AND nested(data.Properties, (brand:\"ABC\" AND country: \"USA\"))
AND nested(data.Properties, (weight:\">500\"))"
}
The resulting query for Elasticsearch will be like this:
{
"query": {
"bool": {
"must": [
{"match": {"tool": { "query": "hammer" }}},
{"nested": {
"path": "properties",
"query": {
"bool": {
"must": [
{"match": {"properties.brand": {"query": "ABC"}}},
{"match": {"properties.country": {"query": "USA"}}}
]
}
}
}
},
{"nested": {
"path": "properties",
"query": {
"bool": {
"must": [
{"range": {"properties.weight": {"gte": "500"}}}
]
}
}
}
}
]
}
}
}
Outcome of the decision:
- format for QUERY section:
one level nesting:
nested(path, query)
multi level nesting:
...nested(path1, (...nested(path12, (...nested(path123, (...)...)...)...)...)...)
example:
"query": "(data.Tool:\"hammer\")
AND nested(data.Properties, (brand:\"ABC\" AND country: \"USA\"))
AND nested(data.Properties, (weight:\">500\"))
- format for SORT section:
format:
nested(path, field, mode)
example:
"sort": {
"field": ["nested(data.Properties, brand, min)", "nested(data.Properties, country, min)"],
"order": ["ASC", "ASC"]
}
- format for AGGREGATION section:
format:
nested(path, field)
example:
"aggregateBy": "nested(data.Properties, brand)",
Rationale
Nested object query is a valuable type of search. Queries may be very sophisticated and include multiple AND/OR conditions addressed to different pieces of indexed document data structure, including multiple mentions of the same or different "nested" arrays objects. Only arrays, indexed as "nested", allow really accurate search by set of properties of each array item object.
The proposed API query format allows to descript all these complicated composite conditions.
Consequences
- Educate the DD community on the pros and cons of using nested type vs flattened.