Partition service's (azure-provider) latency is more than 300 seconds
There are latencies (more than 300 seconds) on Partition API (azure-provider).
An inspection showed that there is 2 minutes timeout for Azure TableStorage which can be the cause of the latencies.
10 minutes latency reproduced locally with the following conditions:
- Endpoints GET /api/partition/v1/partitions or /api/partition/v1/partitions/{partitionId}
- Not data in cache.
- Azure Table storage is unavailable or responding too slow.
- Many requests to API (more than 500).
Presumably, if a cache became outdated during high-load many simultaneous requests are send to TableStorage. All requests which were sent before TableStorage response caching will create new requests to TableStorage and will be waiting for response up to 2 minutes. Finally, the API latency grows.
The solution is to use a cluster lock during the request to TableStorage. It's a copy of this solution from the Entitlements repository: https://community.opengroup.org/osdu/platform/security-and-compliance/entitlements/-/blob/master/provider/entitlements-v2-azure/src/main/java/org/opengroup/osdu/entitlements/v2/azure/service/GroupCacheServiceAzure.java#L81
@Qualifier("cachedPartitionServiceImpl") was removed to make the bean "CachedPartitionServiceImpl" overridable. CachedPartitionServiceImpl (defined in partition-core) was redefined with ProviderCachedPartitionServiceImpl (defined in partition-azure). CachedPartitionService interface was introduced to resolve ambiguities for beans CachedPartitionService and PartitionServiceImpl. Both of them inherit IPartitionService. Now CachedPartitionService resolves ambiguities instead of @Qualifier("cachedPartitionServiceImpl").
New code was tested with the same conditions and the latency didn't grow.