No, closing it. Thanks @chad
Adding retry, timeout and Circuit breaker to services to improve resiliency.
This is to implement resiliency in all the OSDU services. We want to implement the changes at core-common level for HttpClientHandler implementation.
For Azure, we have implemented resiliency features like retries, timeout and Circuit breaker in some of our services(Legal, Entitlement, Search) by adding new custom Http Client Handler(HttpClientHandlerAzure) in CORE_LIB_AZURE as here.
We want to add this resiliency feature in all our OSDU services which will require either change in each service at Core-lib-Azure or common change in core-common.
If we don't introduce new changes in the core-common then we will end up creating individual factory classes in Core-lib-Azure for each service with custom changes made to implement retries, timeout and CB.
Analysis | Core-Lib-Azure | Core-Commons |
---|---|---|
Pros | 1. Turn around time for code check-in is faster | 1. Resiliency will be implement for each service for every CSP by default when the feature flag is ON 2. Faster development for any new service for adding resiliency Consistency across CSP to enable resiliency |
Cons | 1. Implement resiliency at each service factory class 2. Any new service resiliency code has to be added explicitly |
1. If resiliency flag is ON, we need to disable explicitly for a service |
In distributed systems, transient failures or latency in remote interactions are inevitable. Timeouts keep systems from hanging unreasonably long, retries can mask those failures, and backoff and jitter can improve utilization and reduce congestion on systems.
To make our services resilient enough to anticipate unexpected events and account for them. In case of downtime also, we want to make sure there is enough time for pods to recover from an incident by not letting it bombarded with more number of requests.
Based on the above trade off analysis, we are proposing changes in core-common. Having multiple services with similar functionalities and responsibilities is an additional overhead w.r.t maintenance.
To make changes of resiliency by introducing new HTTP client which will be consumed under feature flag. Default HttpClient will be same and CSPs can choose resilient HttpClient. Here's a code snippet that we are using to validate resiliency. But all of it will be under feature flag moving forward.
Currently re-index API sends out 200 response code no matter reindexing activity is performed or not. Now, sending out 500 status code if any of the kind is not reindexed. If successful in pitting messages in servcie bus then 200. Also reindex by kind will show the same behaviour.
Advancement : Saving all the response for each kind in hashmap. That can be used to send as response to client for better tracking of what kind is not reindexed.
SHEFFALI JAIN (0f40078a) at 22 Jun 05:30
reindex status code fix
Please provide link to gitlab issue or ADR(Architecture Decision Record)
We are trying to provide design via which rate limiting can be applied to any service if enabled via flag and will be disabled by default. By setting a limit on how many requests a consumer is allowed to make in a given unit of time. We reject any requests above the limit with an appropriate response, like HTTP status 429 (Too Many Requests).
Currently, no rate limiting is applied on service which can limit the count of users accessing it.
The service will have specific token count which will set limit to restrict number of users using the service per cycle.
No.
Added Envoy filter to apply rate limiting. Added support to generate yaml file via Helm in deployment itself. Initially value of applying rate limit filter is set to false. It can be set while installing helm command using following instruction : --set envoyFilter.enabled=true
SHEFFALI JAIN (883e1fda) at 17 May 06:39
SHEFFALI JAIN (5451c0a4) at 17 May 06:39
Merge branch 'indexerREsiliency' into 'Azure/OSDU-Helm-Charts-Azure...
... and 1 more commit
Workflow used:
Timeout:
Referring to below SS, at ~30rps, max time taken to respond was approx 1 minute. On the basis of discussion with SO and their previous experience it has been finalised as 180 secs.
Circuit Breaker:
Ejection timeout(Time taken by REgister pod to restart itself from failure ~2 mins) :
Rate limit:
Number of requests that does not lead to 503s.
p.s. Check infra related changes and prior discussions here : !247 For retries here : !245
SHEFFALI JAIN (44a4236a) at 11 May 10:16
adding resiliency in Register services
... and 193 more commits
SHEFFALI JAIN (17e2269e) at 11 May 05:39
SHEFFALI JAIN (96355056) at 06 May 09:06
SHEFFALI JAIN (f71a4372) at 06 May 09:06
Merge branch 'indexertimeoutfix' into 'Azure/OSDU-Helm-Charts-Azure...
... and 1 more commit
For adding timeout to rest of the APIs of Indexer will be taken in next sprint.