ADR: Search text with special characters '_' and '.'

Status

  • Proposed
  • Under review
  • Approved
  • Retired

Context & Scope

Principal Motivation: OSDU indexer and search use Elasticsearch default analyzer (or called standard analyzer) to analyzes the unstructured text when they are indexed and searched. Due to the way Elasticsearch standard analzyer analyzes unstructured text, it is very difficult if not impossible to perform certain high-value searches on unstructured content. For example, users want to search for a file with file name 1-ABC_Seismic_Report.pdf, it is impossible to use one or two keywords in the file name like "abc", "seismic", "report" to search the file or pdf extension to find search all pdf files. User can't even use wildcard like *seismic* to search the file as wildcard in prefix is not supported. The user would have to search using exact match or at a minimum ABC_Seismic* if they want to use wildcards.

We found that Elasticsearch standard analyzer has different processing between some similar special characters, for example, between dash - and underscore _, comma , and dot .. If underscore _ is processed like dash - and dot . like comma ,, then the above search limitation can be solved easily.

Scope: In this ADR, we propose a change to extend the Elasticsearch Standard Analyzer to process two additional special characters as word delimiter:

  • underscore _
  • dot .. It will be handled like character ,. Please note that Elasticsearch Standard Analyzer does not take the , as word delimiter if it is part of number string, e.g. 1,663m. In this proposal, the . will be processed in the similar way, e.g. -999.25 or 10.88 in which . won't be treated as word delimiter.

Approach:

  • Create a custom analyzer that inherits the ElasticSearch standard analzyer which consists of:

    Tokenizer

    • Standard Tokenizer

    Token Filters

    • Lower Case Token Filter

    For detail of ElasticSearch standard analzyer definition, please refer to Standard analyzer

  • Replace _ character with white-space

  • Replace . character with white-space if it is not part of a number string.

Use Case with Input Text Samples Scenario

  1. dev_tools_console.pdf will be mapped to dev tools console pdf before tokenization and lower case token filter.
  2. No.10 will be mapped to No 10 before tokenization and lower case token filter.
  3. . in number string like 232.113, -999.25, 3.14159 etc won't be mapped to whitespace.

Trade-off Analysis

  • In order to avoid significant change on the current search behavior and unexpected results, our proposed custom analyzer inherits the existing default (standard) analayzer and only processes two additional special characters _ and . as word delimiter in similar way like the other two similar special characters - and ,.
  • In order to reduce risks (e.g. work interruption) on re-indexing, we will make this solution as feature to be managed in data partition level using partition service. That way the usage can be rolled out in a controlled manner.

Decision

Consequences

Known Issues/Limitation/Notes

  • It will require re-indexing kinds in the given partition in oder to adopt the custom analyzer.
Edited by Zhibin Mai