Very high number of 429s on CosmosDb when there is a usage spike in Storage `query/records:batch api`
In one of our client environments, we are consistently seeing very high number of 429 errors from CosmosDb. This is causing latency spikes for Storage apis.
From our investigation, this seems to be related to the query/records:batch api performance/optimization issue. We see a direct correlation between query/records:batch api
spike and CosmosDb 429 error spike within multiple time windows. Please see attached images for reference.
In the first image, we can see a time window when CosmosDb threw a lot of 429 errors. In the second image, we can see Storage api usage pattern. Most of the api calls are made to the query/records:batch api
which also affects latency numbers. The patterns on both images are very similar
We've tried increasing the RUs on cosmosDb on multiple incidents but that doesn't help.
Further load tests showed that query/records:batch can be a root cause of the 429 errors.
Into the scope of the issue fixing it would be reasonable to implement some features from the topic https://docs.microsoft.com/en-us/azure/cosmos-db/sql/performance-tips-query-sdk?tabs=v3&pivots=programming-language-java