ADR: Search text with special characters '_' and '.'
Status
-
Proposed -
Under review -
Approved -
Retired
Context & Scope
Principal Motivation: OSDU indexer and search use Elasticsearch default analyzer (or called standard analyzer) to analyzes the unstructured text when they are indexed and searched. Due to the way Elasticsearch standard analzyer analyzes unstructured text, it is very difficult if not impossible to perform certain high-value searches on unstructured content. For example, users want to search for a file with file name 1-ABC_Seismic_Report.pdf
, it is impossible to use one or two keywords in the file name like "abc", "seismic", "report" to search the file or pdf extension to find search all pdf files. User can't even use wildcard like *seismic*
to search the file as wildcard in prefix is not supported. The user would have to search using exact match or at a minimum ABC_Seismic* if they want to use wildcards.
We found that Elasticsearch standard analyzer has different processing between some similar special characters, for example, between dash -
and underscore _
, comma ,
and dot .
. If underscore _
is processed like dash -
and dot .
like comma ,
, then the above search limitation can be solved easily.
Scope: In this ADR, we propose a change to extend the Elasticsearch Standard Analyzer to process two additional special characters as word delimiter:
- underscore
_
- dot
.
. It will be handled like character,
. Please note that Elasticsearch Standard Analyzer does not take the,
as word delimiter if it is part of number string, e.g.1,663m
. In this proposal, the.
will be processed in the similar way, e.g.-999.25
or10.88
in which.
won't be treated as word delimiter.
Approach:
-
Create a custom analyzer that inherits the ElasticSearch standard analzyer which consists of:
- Standard Tokenizer
- Lower Case Token Filter
For detail of ElasticSearch standard analzyer definition, please refer to Standard analyzer
-
Replace
_
character with white-space -
Replace
.
character with white-space if it is not part of a number string.
Use Case with Input Text Samples Scenario
-
dev_tools_console.pdf
will be mapped todev tools console pdf
before tokenization and lower case token filter.
-
No.10
will be mapped toNo 10
before tokenization and lower case token filter.
-
.
in number string like232.113
,-999.25
,3.14159
etc won't be mapped to whitespace.
Trade-off Analysis
- In order to avoid significant change on the current search behavior and unexpected results, our proposed custom analyzer inherits the existing default (standard) analayzer and only processes two additional special characters
_
and.
as word delimiter in similar way like the other two similar special characters-
and,
. - In order to reduce risks (e.g. work interruption) on re-indexing, we will make this solution as feature to be managed in data partition level using partition service. That way the usage can be rolled out in a controlled manner.
Decision
Consequences
Known Issues/Limitation/Notes
- It will require re-indexing kinds in the given partition in oder to adopt the custom analyzer.