[BUG] Cache rebuild thread safe
We currently implemented the simple short-living cache, if there is hotspot keys (meaning the requests to them are very frequent) in the system, it may cause cache breakdown problem when the hotspot key is flushed and many requests come to the system at the same time, because there will be a lot of threads to build the cache and cause the backend load to increase.
The common solutions for this problem are
- Never expire a key (ttl=0). It brings the optimum read performance but also brings eventual consistency if we can't synchronously rebuild all impacted cache entries when write operation happens.
- Use mutex key. Only allows one thread (the one acquires the lock for the cache key) to rebuild the cache.
In order to not introduce eventual consistency (breaking change) in OSDU entitlements v2 service, we select 2nd option. We have multiple pods running the service, so we need a distribution lock solution. We should avoid implementing our own locking algorithm to prevent deadlock scenario considering most create lock and set lock expiration are usually 2 separate commands and not atomic by default.
Since we have already used redis, we can use redlock algorithm as the distribution lock, and redisson is the suggested Java library implemented this algorithm.
Pseudocode
Get the cache
If cache hit, return
if cache miss:
if Redisson lock.tryLock(TIMEOUT, LOCK_EXPIRATION_TIME) acquires the lock:
try {
rebuild the cache and return value
} finally {
release the lock
}
else:
back off retry with constant sleep time for MAX_TIME
return value whenever the cache hit
return empty if timeout
Parameters
TIMEOUT: redisson timeout configuration. We should set this as small as possible, so in theory if this TIMEOUT is much less than the cache rebuild cost, it guarantees only one thread can acquire the lock
LOCK_EXPIRATION_TIME: redisson lock expiration configuration to prevent deadlock if any exception happened between lock acquisition and lock release. The perfect setting should be a little bit larger than the cache rebuild latency. We can set up as 5s for now.
MAX_TIME: back off retry max waiting time. This time allows other thread waiting for the thread acquired lock to rebuild the cache and get that value. The time needs to be larger then cache rebuild latency. We can set up as 3 seconds and retry every 200ms depends on the cache rebuild latency (the less frequent retry the more latency it may add overall but less traffic to redis instance)