Implement long lived redis cache
Implement cache aside pattern
We want to extend the Entitlements to use a cache aside pattern on all read operations.
The concern is the max rate that Cosmos can be used is limited to 10000 RUs per logical partition (matches a real partition in OSDU). We know that with users having large amounts of groups assigned this could greatly limit how many can can concurrently read from Cosmos simultaneously (perhaps as low as 5 at the same time).
We need to make sure the cache hit rate is high and we use Cosmos as little as possible. The cache aside pattern will allow for this but we also need to configure it for our needs
Standard TTL
Suggestion: 10 minutes
One concern we have is that the operation to find all keys to expire based on a write operation could fail. Cosmos Graph does not support transactions so we have to do these 2 operations synchronously meaning the update operation succeed but the read fail.
In this case the Redis will be stale. We can set a TTL that will enable a high cache hit rate (60 seconds gave us >90%) but will also eventually refresh in case of this failure.
We should do the read operation before the write operation to mitigate against this.
Expire TTL
Suggestion: 1 second
Suggestion: Max 1 second jitter
When we expire a key based on a write operation normally it is set to immediately expire. However we have already proven that have a eventual consistency of 1 second does not affect the user experience as the initial cache just expired after 1 second.
We can therefore exploit this by setting the expiration to 1 second after a write operation and this will batch concurrent write operations in that time meaning we don't thrash refreshing the cache as often during periods of concurrent writes.
If the TTL on a key has been changed from the default value by a different concurrent write operation it should not be reset and should expire after the 1 second of the first update.
We also have a configurable max jitter which we add onto the expire time to help spread out when keys are refreshed.
Mitigations
Entitlements is the core service in OSDU. every service depends on it. If list groups is not functioning nothing in the system is.
There is potential under load the system could still return high error rates due to Cosmos being overwhelmed. We could make the TTL configurable per partition (via partition service).
For instance if we set the default value to 10 minutes and 1 second, in case of high errors from Entitlements a mitigation could be to increase the TTL to higher values e.g. 60 minutes, 15 seconds.
Although this means eventual consistency is higher the vast majority of the system will behave correctly.
This means these need to be configurable values we can set per partition in the partition service to allow for dynamic configuration. We should have base default values in the service which the partition info can override.
The values for the TTL can be discussed and tested over time.