- I have added an explanation of what changes in this merge do and why we should include it?- Yes
- Does the MR contain pipeline/ helm chart related changes? - NA
- I have updated the documentation accordingly. -Yes
- I have added tests to cover my changes. - Yes
- All new and existing tests passed. -Yes
- My code follows the code style of this project. -Yes
- I ran lint checks locally prior to submission. -Yes
What is the issue or story related to the change?
This MR tries to solve this issue. This intermittent problem of records not being found is manifested by CosmosDb due to replication lag. The indexer service is affected by this issue whenever it(indexer service) makes
http://storage/api/storage/v2/query/records:batch call. Hence, retry with maximum number of attempts as 3 and wait duration in between as 1 second is enabled only for the required call.
High level design:
This document about CosmosDb mentions
The read latency for all consistency levels is always guaranteed to be less than 10 milliseconds at the 99th percentile. The average read latency, at the 50th percentile, is typically 4 milliseconds or less." Therefore, max attempt of 3 and wait duration of 1 sec is thought to be sufficient.
Change details: Our interest area is when indexer service makes
http://storage/api/storage/v2/query/records:batch call. This call is handled by UrlFetchService( for more details see here.). This MR adds 2 classes and their corresponding Tests classes: -
- RetryPolicy:- Handles the retry logic. The one point to pay attention is that in the above issue when records are not found even then the response status code is 200 and unfound records are listed under "notFound" jsonArray of the response body. Hence, the retry logic is handled accordingly. The retry functionality is enabled by using decorators provided by resilience4j over UrlFetchService(the one residing is core-common not over UrlFetchServiceAzureImpl which is newly added)
- UrlFetchServiceAzureImpl:- This class implements IUrlFetchService, the same interface implemented by UrlFetchService. This class simply calls UrlFetchService's methods of all the calls other than
http://storage/api/storage/v2/query/records:batch. For this call, it calls the retryFunction of RetryPolicy. This is made primary bean so that this class gets injected instead of UrlFetchService.
The above methodology was used because it made zero changes in any of the existing code.
Does this introduce a breaking change?
This code has also been tested using FireFly framework. Using perf tool, we have confirmation on true-negative i.e. not re-trying when it is not supposed to but not of true-positive as this issue is not easily reproduced. The issue was not reproduced therefore, the retry functionality was tested locally by making the storage service end-point to not find records randomly. Following things were confirmed from local testing:
- Retry was not enabled for any other end-point.
- Retry was enabled when "notFound" was not empty.
- Retry was not enabled for when "notFound" was empty.
- Retry happened only once or twice in case "notFound" was empty in second or third time. The retry functionality is thoroughly unit-tests covered.