ADR: Bag of Words

ADR: Copy all text field to BagOfWords field

ADR: Copy all text field to BagOfWords field
Status
Background
Context & Scope
- Requirements
Tradeoff Analysis
- Option 1
- Option 2
Proposed solution
- Accepted Limitations / things to work out
Change Management
Decision
Consequences

Status

Background

The application development stakeholders want to provide their users a mechanism to search for words in a record regardless of where it appears in the record. Currently this is not working for nested fields as inner mechanism is relying on query_string ES query which is not allowing searching through nested documents.

Context & Scope

Back to TOC

Requirements

User is able to find resources by words stored in any field using query without using explicit field names.
User is able to find resources referencing given ID from external systems if this ID is part of referencing OSDU ID.
(Additional) List of all phrases is stored inside single field to be able to implement simple autocompletion.

Back to TOC

Tradeoff Analysis

Option 1

All the fields are copied and to the word-bag using copy_to mechanism. We are proposing bagOfWords as the internal field name for this use case. This enables the user to find wells through their alias names using fulltext query (name aliases are stored in the nested array, so currently it is not possible without explicitly specifying field name).Additionally, to bagOfWords we would like to add ID detail as they are often IDs from external source systems like (“osdu:wks::master-data—Well-1.0.0:43234324” detail here may contain UWI). So, when the users know 4323424 (for example from the source system) but don't know OSDU internal ID system, they are still able to find records referencing them (for example find all DS related to given wellbore). Such a field is also valuable for implementing search-as-you-type autocompletion, we can create simple but powerful version of it by just adding a subfield with ES completion indexing and expose it for searching.

Option 2

If for some reason alternative 1 is too broad, it is suggested to use the indexing hints added to the schema files as described here: https://gitlab.opengroup.org/osdu/subcommittees/ea/work-products/adr-elaboration/-/issues/66. A tag such as x-osdu-indexing-copytowordbag could be an indicator that the associated field is to be added to the workbag field:

“x-osdu-indexing-copytowordbag”: “enabled”/"disabled"

for example. However such approach would make schemas less portable as every OSDU installation may have different needs.

Back to TOC

Proposed solution

For each kind of resource, an index will be created and the value will contain all (normalized) tokens across all other text fields in the mapping.

This will enable a query of the form:

{
    "kind": "osdu:*:*:*",
    "query": "test"
}

which would return

{
  "results": [
    {
      "data": {
        "FacilityName": "Example test"
      },
      "id": "osdu:master-data--Well:1012"
    },
    {
      "data": {
        "FacilityNameAlias": "Example test"
      },
      "id": "osdu:master-data--Well:30142"
    }

  ]
}

The search service query against the word_bag field so that the two wells would be returned despite 'test' occurring in different fields.

Back to TOC

Accepted Limitations / things to work out

Back to TOC

Change Management

Operators may need to execute reindex with force_clean=true action on indices to enable this feature.

Decision

Consequences

The indexer code changes should have no impact on automated applications as they are using field related queries which are unchanged. Application where user is controlling top level query might show new additional results (for matches in nested objects and in ID details), but this is expected behavior.

Back to TOC

#EOF.

Edited Mar 18, 2024 by Mark Chance