ADR : Implementing retries, CB and timeouts by default in services | RESILIENCY
Decision Title
Adding retry, timeout and Circuit breaker to services to improve resiliency.
Status
-
Proposed -
Trialing -
Under review -
Approved -
Retired
Context & Scope
This is to implement resiliency in all the OSDU services. We want to implement the changes at core-common level for HttpClientHandler implementation.
For Azure, we have implemented resiliency features like retries, timeout and Circuit breaker in some of our services(Legal, Entitlement, Search) by adding new custom Http Client Handler(HttpClientHandlerAzure) in CORE_LIB_AZURE as here.
We want to add this resiliency feature in all our OSDU services which will require either change in each service at Core-lib-Azure or common change in core-common.
Trade-off Analysis
If we don't introduce new changes in the core-common then we will end up creating individual factory classes in Core-lib-Azure for each service with custom changes made to implement retries, timeout and CB.
Analysis | Core-Lib-Azure | Core-Commons |
---|---|---|
Pros | 1. Turn around time for code check-in is faster | 1. Resiliency will be implement for each service for every CSP by default when the feature flag is ON 2. Faster development for any new service for adding resiliency Consistency across CSP to enable resiliency |
Cons | 1. Implement resiliency at each service factory class 2. Any new service resiliency code has to be added explicitly |
1. If resiliency flag is ON, we need to disable explicitly for a service |
Advantages
In distributed systems, transient failures or latency in remote interactions are inevitable. Timeouts keep systems from hanging unreasonably long, retries can mask those failures, and backoff and jitter can improve utilization and reduce congestion on systems.
Rationale
To make our services resilient enough to anticipate unexpected events and account for them. In case of downtime also, we want to make sure there is enough time for pods to recover from an incident by not letting it bombarded with more number of requests.
Proposal
Based on the above trade off analysis, we are proposing changes in core-common. Having multiple services with similar functionalities and responsibilities is an additional overhead w.r.t maintenance.
To make changes of resiliency by introducing new HTTP client which will be consumed under feature flag. Default HttpClient will be same and CSPs can choose resilient HttpClient. Here's a code snippet that we are using to validate resiliency. But all of it will be under feature flag moving forward.