OutOfMemoryError Max Thread Limit Reached

Problem Description

The service was experiencing Out of Memory (OOM) errors due to thread leaks in the Elasticsearch client. The error logs showed:

[2040.742s][warning][os,thread] Failed to start thread "Unknown thread" - pthread_create failed (EAGAIN) for attributes: stacksize: 2040k, guardsize: 0k, detached.
[2040.742s][warning][os,thread] Failed to start the native thread for java.lang.Thread "elasticsearch-rest-client-3659-thread-4"
Exception in thread "elasticsearch-rest-client-3659-thread-1" java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached

The ps -lfT | wc -l command showed 18,723 threads, which is extremely high. Each thread consumes memory for its stack (typically 1MB or more)

Root Cause Analysis

Unbounded Client Creation Without Cleanup:

The getOrCreateRestClient() method creates a new Elasticsearch client if one doesn't exist in the cache for the current partition ID
However, there was no corresponding method to close these clients when they were no longer needed
Each client maintains its own connection pool and thread pool, which were never being cleaned up

No Connection Pool Limits:

The HttpAsyncClientBuilder didn't have any limits on the number of connections it could create
Without setMaxConnTotal and setMaxConnPerRoute, the client would create an unlimited number of connections
Each connection creates threads, leading to thread exhaustion

Client Reuse Without Resource Management:

The code was caching clients by partition ID, which is good for performance; however, it wasn't managing the lifecycle of these clients properly
Even though clients were being reused, the underlying connections and threads were still accumulating

No Cleanup in Service Methods:

The service methods that used these clients (like processRecordChangedMessages and processSchemaMessages) didn't have any cleanup code
Even when operations were complete, the clients remained open with their connections and threads

Multiple Connection Pools: Each time a new ElasticsearchClient was created, it was creating its own independent RestClient with a separate connection pool and set of worker threads. This was happening because:

The client was being recreated for different partitions
Each client was creating its own transport layer and connection pool

Impact

In AWS case, the threads were exhausted before hitting the memory limit. The end result is the pod doesn't get killed after hitting the OS thread limit, since it is not getting killed from OOM. The existing pods survive but fail to index any more records.

Edited Apr 28, 2025 by Marc Burnie [AWS]