ADR : Implementing retries, CB and timeouts by default in services | RESILIENCY

Decision Title

Adding retry, timeout and Circuit breaker to services to improve resiliency.

Status

Context & Scope

This is to implement resiliency in all the OSDU services. We want to implement the changes at core-common level for HttpClientHandler implementation.

For Azure, we have implemented resiliency features like retries, timeout and Circuit breaker in some of our services(Legal, Entitlement, Search) by adding new custom Http Client Handler(HttpClientHandlerAzure) in CORE_LIB_AZURE as here.

We want to add this resiliency feature in all our OSDU services which will require either change in each service at Core-lib-Azure or common change in core-common.

Trade-off Analysis

If we don't introduce new changes in the core-common then we will end up creating individual factory classes in Core-lib-Azure for each service with custom changes made to implement retries, timeout and CB.

Analysis	Core-Lib-Azure	Core-Commons
Pros	1. Turn around time for code check-in is faster	1. Resiliency will be implement for each service for every CSP by default when the feature flag is ON 2. Faster development for any new service for adding resiliency Consistency across CSP to enable resiliency
Cons	1. Implement resiliency at each service factory class 2. Any new service resiliency code has to be added explicitly	1. If resiliency flag is ON, we need to disable explicitly for a service

Advantages

In distributed systems, transient failures or latency in remote interactions are inevitable. Timeouts keep systems from hanging unreasonably long, retries can mask those failures, and backoff and jitter can improve utilization and reduce congestion on systems.

Rationale

To make our services resilient enough to anticipate unexpected events and account for them. In case of downtime also, we want to make sure there is enough time for pods to recover from an incident by not letting it bombarded with more number of requests.

Proposal

Based on the above trade off analysis, we are proposing changes in core-common. Having multiple services with similar functionalities and responsibilities is an additional overhead w.r.t maintenance.

To make changes of resiliency by introducing new HTTP client which will be consumed under feature flag. Default HttpClient will be same and CSPs can choose resilient HttpClient. Here's a code snippet that we are using to validate resiliency. But all of it will be under feature flag moving forward.

Edited Jan 29, 2022 by SHEFFALI JAIN