ADR: Bag of Words
ADR: Copy all text field to BagOfWords field
- ADR: Copy all text field to BagOfWords field
- Status
- Background
- Context & Scope
- Tradeoff Analysis
- Proposed solution
- Change Management
- Decision
- Consequences
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Background
The application development stakeholders want to provide their users a mechanism to search for words in a record regardless of where it appears in the record. Currently this is not working for nested fields as inner mechanism is relying on query_string
ES query which is not allowing searching through nested documents.
Context & Scope
Requirements
- User is able to find resources by words stored in any field using query without using explicit field names.
- User is able to find resources referencing given ID from external systems if this ID is part of referencing OSDU ID.
- (Additional) List of all phrases is stored inside single field to be able to implement simple autocompletion.
Tradeoff Analysis
Option 1
All the fields are copied and to the word-bag using copy_to mechanism. We are proposing bagOfWords
as the internal field name for this use case. This enables the user to find wells through their alias names using fulltext query (name aliases are stored in the nested array, so currently it is not possible without explicitly specifying field name).Additionally, to bagOfWords
we would like to add ID detail as they are often IDs from external source systems like (“osdu:wks::master-data—Well-1.0.0:43234324” detail here may contain UWI). So, when the users know 4323424 (for example from the source system) but don't know OSDU internal ID system, they are still able to find records referencing them (for example find all DS related to given wellbore). Such a field is also valuable for implementing search-as-you-type autocompletion, we can create simple but powerful version of it by just adding a subfield with ES completion indexing and expose it for searching.
Option 2
If for some reason alternative 1 is too broad, it is suggested to use the indexing hints added to the schema files as described here: https://gitlab.opengroup.org/osdu/subcommittees/ea/work-products/adr-elaboration/-/issues/66. A tag such as x-osdu-indexing-copytowordbag could be an indicator that the associated field is to be added to the workbag field:
“x-osdu-indexing-copytowordbag”: “enabled”/"disabled"
for example. However such approach would make schemas less portable as every OSDU installation may have different needs.
Proposed solution
For each kind of resource, an index will be created and the value will contain all (normalized) tokens across all other text fields in the mapping.
This will enable a query of the form:
{
"kind": "osdu:*:*:*",
"query": "test"
}
which would return
{
"results": [
{
"data": {
"FacilityName": "Example test"
},
"id": "osdu:master-data--Well:1012"
},
{
"data": {
"FacilityNameAlias": "Example test"
},
"id": "osdu:master-data--Well:30142"
}
]
}
The search service query against the word_bag field so that the two wells would be returned despite 'test' occurring in different fields.
Accepted Limitations / things to work out
Change Management
- Operators may need to execute reindex with force_clean=true action on indices to enable this feature.
Decision
Consequences
- The indexer code changes should have no impact on automated applications as they are using field related queries which are unchanged. Application where user is controlling top level query might show new additional results (for matches in nested objects and in ID details), but this is expected behavior.
#EOF.