Skip to content

Draft: Reset Cosmos connections if Gone Exception is thrown

Krishna Nikhil Vedurumudi requested to merge krveduru/goneexp into master

All Submissions:

  • [YES/NO] I have added an explanation of what changes in this merge do and why we should include it?
  • [YES/NO] I have updated the documentation accordingly.
  • [YES/NO/NA] I have added tests to cover my changes.
  • [YES/NO/NA] All new and existing tests passed.
  • [YES/NO/NA] My code follows the code style of this project.
  • [YES/NO/NA] I ran lint checks locally prior to submission.

What is the issue or story related to the change?

Cosmos SDK throws a GoneException when the current connection is unusable due to any reason, be it a Connection becoming stale due to Cosmos re-partitioning or Gone (Http Status 410).

This causes our services to be non-functional for a while as the cached connections are not updated soon enough.

Add additional handling to GoneException to reset Cache when these exceptions occur so that subsequent requests use a new connection, there by reducing the blast radius significantly.

High level design:


When CosmosStore handles GoneException, currently we only return a 500.

Test coverage:

Does this introduce a breaking change?

  • [YES/NO]

Pending items

Reviewer request

  • Please provide an ETA when you plan to review this MR. Write a comment to decline or provide an ETA.
  • Block the MR if you feel there is less testing or no details in the MR
  • Please cover the following aspects in the MR -- Coding design: <Reviewer1> -- Backward Compatibility: <Reviewer2> -- Feature Logic: <Logic design> -- <Any other context mention here> OR -- <Component 1>: <Reviewer1> -- <CosmosDB>: <Reviewer2> -- <ServiceBus> <Reviewer3> -- <Mention any other component and owner>

Other information

Merge request reports