Fix for Too many results returned after bagofwords feature - change bagOfWords.autocomplete's analyzer from simple to standard

Tries to fix the issue reported here. That issue was closed by changing the integration tests to use search for specific fields ( from previous top level search).

Additional use cases -

  1. Usecase#1 - Client has raised a bug with following change of search behavior. A platform instance has 2 wellbore records with data.facilityname as "AVS-72", "AVS-71". The instance has bagOfWords feature enabled in indexer and autocomplete feature enabled in search. On this instance a top level query for "AVS-72" returns both the records. On another OSDU instance where these features are not enabled - a top level query on "AVS-72" would only return a single document. And this is the expectation.
  2. Usecase#2 - The autocomplete suggester does not work for just numeric values in text fields. For eg - I indexed 2 documents with data.FacilityName : "1234" and another with data.FacilityName: "1235". If I try to get suggestions for "123" no suggestion is returned. Expectation was to get both the suggestions "1234" and "1235".

Cause -

  1. As documented here, completer field by default uses simple analyzer.

    The simple analyzer breaks text into tokens at any non-letter character, such as numbers, spaces, hyphens and apostrophes, discards non-letter characters, and changes uppercase to lowercase.

  2. For Usecase#1 above, the bagOfWords.autocomplete uses simple analyzer during indexing and search. Both the records thus have value for the field as "avs". When top level search query "AVS-72" is received, simple analyzer creates a term "avs" and finds match in both records and return both records. Response of _search api with explain=true query parameter in the attached before-fiximpl-evidence-usecase1-and-usecase2.txt confirms the same.

  3. For Usecase#2 above, the bagOfWords.autocomplete does not capture anything as all numbers are dropped. Thus no suggestions for "123" are returned.

Suggested Fix -

  1. Change the analyzer for the autocomplete field in bagofwords from current simple to standard. This will make the analyzer consistent across data text fields and bagofwords.

Validation -

  1. For simulating the issue, I created an index "adeole-index-110525-174600" with a mapping similar to what would be created with bagofwords feature on in indexer today. I then created another index adeole-index-110525-174600**-2** with fix. Just changed the analyzer for autocomplete subfield in bagofwords.

                "bagOfWords": {
                    "type": "text",
                    "store": true,
                    "fields": {
                        "autocomplete": {
                            "type": "completion",
                            "analyzer": "standard"
                        }
                    }
                }
  2. Using bulk api I indexed 2 documents - one with data.FacilityName AVS-71 and another with AVS-72 for both indices. I could simulate the usecase#1 with first index. The top level search for "AVS-72" correctly returned just 1 hit with later index (ending with -2). Which is expected.

  3. For usecase#2, I indexed another 2 documents in both the indices with data.FacilityName as "1234" and "1235". Below query to get suggestions returned nothing for first index and correctly returned both the suggestions for later index (ending with -2)

    curl --location --request POST 'http://localhost:9200/adeole-index-110525-174600/_search' \
    --header 'Authorization: <>' \
    --header 'Content-Type: application/json' \
    --data-raw '{
        "suggest": {
            "my-suggest-1": {
                "prefix": "123",
                "completion": {
                    "field": "bagOfWords.autocomplete"
                }
            }
        }
    }

    after-fiximpl-evidence-usecase1-and-usecase2.txt

    before-fiximpl-evidence-usecase1-and-usecase2.txt

Edited May 12, 2025 by Ashish Deole
Assignee Loading
Time tracking Loading