ADR: Search text with special characters '_' and '.'

Status

Proposed
Under review
Approved
Retired

Context & Scope

Principal Motivation: OSDU indexer and search use Elasticsearch default analyzer (or called standard analyzer) to analyzes the unstructured text when they are indexed and searched. Due to the way Elasticsearch standard analzyer analyzes unstructured text, it is very difficult if not impossible to perform certain high-value searches on unstructured content. For example, users want to search for a file with file name 1-ABC_Seismic_Report.pdf, it is impossible to use one or two keywords in the file name like "abc", "seismic", "report" to search the file or pdf extension to find search all pdf files. User can't even use wildcard like *seismic* to search the file as wildcard in prefix is not supported. The user would have to search using exact match or at a minimum ABC_Seismic* if they want to use wildcards.

We found that Elasticsearch standard analyzer has different processing between some similar special characters, for example, between dash - and underscore _, comma , and dot .. If underscore _ is processed like dash - and dot . like comma ,, then the above search limitation can be solved easily.

Scope: In this ADR, we propose a change to extend the Elasticsearch Standard Analyzer to process two additional special characters as word delimiter:

underscore _
dot .. It will be handled like character ,. Please note that Elasticsearch Standard Analyzer does not take the , as word delimiter if it is part of number string, e.g. 1,663m. In this proposal, the . will be processed in the similar way, e.g. -999.25 or 10.88 in which . won't be treated as word delimiter.

Approach:

Create a custom analyzer that inherits the ElasticSearch standard analzyer which consists of:

Tokenizer
- Standard Tokenizer
Token Filters
- Lower Case Token Filter
For detail of ElasticSearch standard analzyer definition, please refer to Standard analyzer
Replace _ character with white-space
Replace . character with white-space if it is not part of a number string.

Use Case with Input Text Samples Scenario

dev_tools_console.pdf will be mapped to dev tools console pdf before tokenization and lower case token filter.
No.10 will be mapped to No 10 before tokenization and lower case token filter.
. in number string like 232.113, -999.25, 3.14159 etc won't be mapped to whitespace.

Trade-off Analysis

In order to avoid significant change on the current search behavior and unexpected results, our proposed custom analyzer inherits the existing default (standard) analayzer and only processes two additional special characters _ and . as word delimiter in similar way like the other two similar special characters - and ,.
In order to reduce risks (e.g. work interruption) on re-indexing, we will make this solution as feature to be managed in data partition level using partition service. That way the usage can be rolled out in a controlled manner.

Decision

Consequences

Known Issues/Limitation/Notes

It will require re-indexing kinds in the given partition in oder to adopt the custom analyzer.

Edited Aug 23, 2024 by Zhibin Mai