Skip to content

fix some bottleneck found during perf test failure investigation

Yannick requested to merge fix_perf into master

fix some bottleneck found during perf test failure investigation. Fixes are:

On Azure lib, not in this MR see corresponding MR

  • Secret value pinned to partition info (cached from 5 minutes), reason = get secret is done synchronous for each request causing the service to be unresponsive in average during 90ms.
  • Add an asyncio lock after unsuccessfully checking cache, reason = prevent unnecessary call to fetch partition info and lead to cache reset in case on multiple concurrency.

WDMS Service, this MR

  • Extend backoff to exception client_storage.ResponseHandlingException, reason = Error such as "server disconnected" are caught and rethrown as ResponseHandlingException. Is that case provide some resilience, which tends to occur during stress.

  • Bulk data (de)serialization done asynchronously. reason = this appears to be the main bottleneck during perf test. Since the task is done synchronously, (de)serialization of bulk may take few 10 ms to few seconds to complete. The service becomes fully unresponsive during this time. The fix is to execute these tasks inside process pool (in favor of a thread pool since it's a CPU bound operation) rather than synchronously inside the main thread. For now the pool size is arbitrary set to 4 (subject to change later). There's also some bootstrap mechanism at service startup, so pool is fully ready at first incoming request.

Edited by Yannick

Merge request reports