Draft: Reset Cosmos connections if Gone Exception is thrown
- [YES/NO] I have added an explanation of what changes in this merge do and why we should include it?
- [YES/NO] I have updated the documentation accordingly.
- [YES/NO/NA] I have added tests to cover my changes.
- [YES/NO/NA] All new and existing tests passed.
- [YES/NO/NA] My code follows the code style of this project.
- [YES/NO/NA] I ran lint checks locally prior to submission.
What is the issue or story related to the change?
Cosmos SDK throws a GoneException when the current connection is unusable due to any reason, be it a Connection becoming stale due to Cosmos re-partitioning or Gone (Http Status 410).
This causes our services to be non-functional for a while as the cached connections are not updated soon enough.
Add additional handling to GoneException to reset Cache when these exceptions occur so that subsequent requests use a new connection, there by reducing the blast radius significantly.
High level design:
When CosmosStore handles GoneException, currently we only return a 500.
Does this introduce a breaking change?
- Please provide an ETA when you plan to review this MR. Write a comment to decline or provide an ETA.
- Block the MR if you feel there is less testing or no details in the MR
- Please cover the following aspects in the MR -- Coding design: <Reviewer1> -- Backward Compatibility: <Reviewer2> -- Feature Logic: <Logic design> -- <Any other context mention here> OR -- <Component 1>: <Reviewer1> -- <CosmosDB>: <Reviewer2> -- <ServiceBus> <Reviewer3> -- <Mention any other component and owner>