Skip to content

OutOfMemoryError Max Thread Limit Reached

Problem Description

The service was experiencing Out of Memory (OOM) errors due to thread leaks in the Elasticsearch client. The error logs showed:

[2040.742s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 2040k, guardsize: 0k, detached.
[2040.742s][warning][os,thread] Failed to start the native thread for java.lang.Thread "elasticsearch-rest-client-3659-thread-4"
Exception in thread "elasticsearch-rest-client-3659-thread-1" java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

The ps -lfT | wc -l command showed 18,723 threads, which is extremely high. Each thread consumes memory for its stack (typically 1MB or more)

Root Cause Analysis

  1. Unbounded Client Creation Without Cleanup:
  • The getOrCreateRestClient() method creates a new Elasticsearch client if one doesn't exist in the cache for the current partition ID
  • However, there was no corresponding method to close these clients when they were no longer needed
  • Each client maintains its own connection pool and thread pool, which were never being cleaned up
  1. No Connection Pool Limits:
  • The HttpAsyncClientBuilder didn't have any limits on the number of connections it could create
  • Without setMaxConnTotal and setMaxConnPerRoute, the client would create an unlimited number of connections
  • Each connection creates threads, leading to thread exhaustion
  1. Client Reuse Without Resource Management:
  • The code was caching clients by partition ID, which is good for performance; however, it wasn't managing the lifecycle of these clients properly
  • Even though clients were being reused, the underlying connections and threads were still accumulating
  1. No Cleanup in Service Methods:
  • The service methods that used these clients (like processRecordChangedMessages and processSchemaMessages) didn't have any cleanup code
  • Even when operations were complete, the clients remained open with their connections and threads
  1. Multiple Connection Pools: Each time a new ElasticsearchClient was created, it was creating its own independent RestClient with a separate connection pool and set of worker threads. This was happening because:
  • The client was being recreated for different partitions
  • Each client was creating its own transport layer and connection pool

Impact

In AWS case, the threads were exhausted before hitting the memory limit. The end result is the pod doesn't get killed after hitting the OS thread limit, since it is not getting killed from OOM. The existing pods survive but fail to index any more records.

Edited by Marc Burnie [AWS]