There are latencies (more than 300 seconds) on Partition API. (!87) · Merge requests · OSDU Software / OSDU Data Platform / System / Partition

Dmitrii Gerashchenko requested to merge trusted-azure-tablestorage-synchronous-cache-update into master Aug 27, 2021

An inspection showed that there is 2 minutes timeout for Azure TableStorage which can be the cause of the latencies.

10 minutes latency reproduced locally with the following conditions:

Endpoints GET /api/partition/v1/partitions or /api/partition/v1/partitions/{partitionId}
Not data in cache.
Azure Table storage is unavailable or responding too slow.
Many requests to API (more than 500).

Presumably, if a cache became outdated during high-load many simultaneous requests are send to TableStorage. All requests which were sent before TableStorage response caching will create new requests to TableStorage and will be waiting for response up to 2 minutes. Finally, the API latency grows.

The solution is to use a cluster lock during the request to TableStorage. It's a copy of this solution from the Entitlements repository: https://community.opengroup.org/osdu/platform/security-and-compliance/entitlements/-/blob/master/provider/entitlements-v2-azure/src/main/java/org/opengroup/osdu/entitlements/v2/azure/service/GroupCacheServiceAzure.java#L81

@Qualifier("cachedPartitionServiceImpl") was removed to make the bean "CachedPartitionServiceImpl" overridable. CachedPartitionServiceImpl (defined in partition-core) was redefined with ProviderCachedPartitionServiceImpl (defined in partition-azure). CachedPartitionService interface was introduced to resolve ambiguities for beans CachedPartitionService and PartitionServiceImpl. Both of them inherit IPartitionService. Now CachedPartitionService resolves ambiguities instead of @Qualifier("cachedPartitionServiceImpl").

New code was tested with the same conditions and the latency didn't grow.

There are latencies (more than 300 seconds) on Partition API.

Merge request reports