ADR: Search text with special characters
Status
-
Proposed -
Under review -
Approved -
Retired
Context & Scope
Principal Motivation: Currently, due to the way Elasticsearch analyzes documents as they are indexed, it is impossible to perform certain high-value searches on unstructured content. This ADR proposes changing the default Elastic index analyzer for both unstructured and structured types so that those high-value searches can be performed. The main business driver for this change is coming from users of unstructured document searches, but due to the heavy change process for making these changes, it makes sense to make some smaller changes to structured analysis as well.
Scope: The indexing analyzer will be changed to alter the way incoming strings are tokenized. This will not be an out-of-the-box analyzer, but will have small adjustments. Mainly this will enable search queries to find exact matches on strings that aren't searchable currently due to special characters that are part of the incoming string but that are tokenized away during indexing. A quick example is a North Sea well name like "8/3-1". This string is indistinguishable from "8-3/1" or "8 3 1" at the search level due to the treatment of the special characters "/" and "-". The proposal is to introduce two new analyzers--one for structured data (which is essentially all name:value pairs in JSON documents) and one for unstructured data which is generally the output of an OCR process or other text extraction process for large source documents.
Key Points
- structured vs. unstructured. The analyzer scope will be determined by the type of the incoming document so that unstructured (generally <= 5% of total data) will be identified by a list of types, and structured will be everything else.
- to implement the new analyzer changes, a re-index of the data will be required. This is heavy-duty and should be done as infrequently as possible.
- the structured analyzer (Approach 2) will use the more generic analyzer which will not require changes for (most) special character cases
- the unstructured analyzer (Approach 1) will be focused on retaining only certain special characters
- the full proposal is to implement both Approach 1 and Approach 2 and re-index just once as that is a heavy-duty operation
Approach 1: Generic character-set analyzer
- Replace
\n ( ) { } [ ]
characters with white-space - Tokenize on white-space
- Applies following token filter:
- Lowercase
- English stemmer
Approach 2: Defined character-set analyzer
- Replace
\n ( ) { } [ ]
characters with white-space - Tokenize on white-space
- Applies following token filter:
- Lowercase
- English stemmer
- Word delimiter: removes all special characters from tokens except
/ - : .
characters
Use Case with Input Text Samples
Scenario
Assumption is that unstructured data is ingested and indexed using the current, approach 1, and approach 2 analyzers. For clarification, the data in numbered points would land in the JSON metadata record in the "data" block (e.g."data":"indexedPage_1")
- As the story has been told many times, exploration efforts in the North Sea were rapidly declining by fall 1969. A total of 32 wells had been drilled in the Norwegian sector since Esso completed the first exploration well (8/3-1) as dry in 1966. It didn't help that Mario had an obsession with the sequences 2-4-1 and 2-8-3 which caused some very strange OCD behavior at a number of review meetings when the exploration wells were discussed. Being an explorationist was much harder than jumping over Goombas.
- Murphy's first assignment for his new employer, well 2/4-1 (the original name is 2/4-1X), nearly ended up as a terrible disaster. The well was spudded on August 21 with Ocean Viking and Max was there from the very first turn of the drill. He was prepared to describe cuttings through a 3,000 m thick section of boring clay and shale before entering the reservoir. The Quaternary and Tertiary clay is soft and the drilling is fast. After only one week casing had been set at 146m (30’’) and 623m (20’’). Drilling had resumed and at 1,663m, on Sunday morning August 31, the formation pressure increased tremendously and oil flowed into the wellbore and the mud tanks.
- dev_tools_console.pdf
- c:\workspace\petrel\gullfaks
- /petrel/workspace/2017/05/16/gullfaks.pet
- measurements are 232.113, -999.25 (old LAS sentinel), 3.14159, etc.
- Som historien har blitt fortalt mange ganger, gikk leteinnsatsen i Nordsjøen raskt ned høsten 1969. Totalt var det boret 32 brønner i norsk sektor siden Esso fullførte den første letebrønnen (8/3-1) som tørr. i 1966. Det hjalp ikke at Mario hadde en besettelse av sekvensene 2-4-1 og 2-8-3 som forårsaket noe veldig merkelig OCD-oppførsel på en rekke gjennomgangsmøter da letebrønnene ble diskutert. Å være en utforsker var mye vanskeligere enn å hoppe over Goombas.
End User Entry | Standard (Current) | Analyzer I | Analyzer II | Comments |
---|---|---|---|---|
2/4-1X |
✗ | ✓ | ✓ | |
2-4-1 |
✗ | ✓ | ✓ | |
-999.25 |
✗ | ✓ | ✓ | |
explore |
✗ | ✓ | ✓ | fuzzy match includes explore and exploration
|
murphy |
✗ | ✓ | ✓ | possessives at the end and search by one token. Search with possessive murphy's works for all |
o'neil |
✗ | ✓ | ✓ | possessives at the end and search by one token. Search with possessive o'neil's works for all |
første letebrønnen |
✓ | ✓ | ✓ | searching non-English phrases (e.g. Norwegian in text fragment 7) |
Trade-off Analysis
- Do nothing.
Pros: no re-indexing required, no chance of breaking changes
Cons: does not solve business problem of correct searches for exact character strings in unstructured and structured data. - Approach 1 Only. Alter indexing analyzer for unstructured types only.
Pros: solves problem where exact match strings for common oilfield terms (e.g. wells) work.
Cons: heavy-duty re-indexing is required to recreate index information for the affected types, leaves structured data searches with same problem re: special characters. - Approach 1 and Approach 2. Alter indexing analyzer for unstructured and structured types.
Pros: solves search problem for both structured and unstructured data, one-time re-indexing operation
Cons: heavy-duty re-indexing.
Decision
If there is agreement by consumers of OSDU data records that it is good to be able to enter searches like "data.indexedDoc:"2/4-1" and find the North Sea well with that name, then something has to be done to the analyzer.
Given the big investment any change to the indexing analyzer would require in re-indexing cycles, the right approach is to implement changes for the structured and unstructured types (Approach 1 and Approach 2). This ensures similar exact search results for things like data.wellname and data.unstructuredText containing the same well name using special characters.
Consequences
Known Issues/Limitation/Notes
- Any change to character-set analyzer will require re-indexing across all partitions and environments.
- We will get the support for non English character out of the box with limitations. Text is analyzed with white-space and English stemmer, if white-space is not used as word boundary (e.g. Japanese) than inaccurate/empty results will be returned.