Implement redis cache and add metrics

ListGroup API is one of our most called APIs in Entitlements. We implement a short lived redis cache to reduce the number of calls made to cosmos db, thus reducing RU consumption as well as improve latency for this API. Starting next season we plan to implement a long lived redis cache which will flush and recreate cache entries on write operations. For now, this is sufficient to meet our load test goals of 200 users calling ListGroup API simultaneously at 110 RPS.

In the redis cache we store the userEmail-dataPartitionId as key and the user's parent groups as value. This is the essence of ListGroup API, which is the most called API in Entitlements service as users/services often check which groups they belong to. The getFromPartitionCache() method is also called in our Authorization filter AuthorizationServiceEntitlements#isAuthorized() to check the if a user has permission to call a certain API. We want to cache this response to improve overall latency on our APIs, but particularly ListGroup API.

The ttl of cache entries is 1 second. We choose 1 second because this is a non-advanced cache implementation. We need to be careful not to cache the value for too long as there may be write operations to a user's groups and the ListGroup API response will need to be updated instead of retrieving from cache which will not reflect the new groups. Making TTL 1 second helps to prevent this delay. We plan to implement an advanced on-demand strategy cache next season where the cache will be flushed and recreated on write operations (POST/DELETE Group API). TTL will be longer in this implementation.

We also send custom metrics for analysis purposes, to understand how many cache hits/misses occur.

Edited Mar 18, 2021 by Tika Lestari [SLB]