I noticed that replica shards were sometimes becoming unallocated under high write load. Manually re-allocating them usually worked for a little bit, but after a little while they would become unallocated again. Looking at the logs, I found errors like these:
(The stack trace is truncated in the output above; see this gist for the full output.)
The fix turned out to be relatively simple (set
-1), but the explanation of the fix is a bit more involved.
elasticsearch operations such as index, search, bulk, and merge, are all done in threads. Each type of operation has its own dedicated thread pool which can be configured independently. Prior to elasticsearch 0.90.0.RC2, thread pools defaulted to type
cache (this changed in 0.90.0.RC2):
cachethread pool is an unbounded thread pool that will spawn a thread if there are pending requests.
Note the word unbounded. From prior experience, I knew this was problematic — if a bunch of requests come in at once, elasticsearch spawns hundreds of threads to service the requests. It spends more time spawning threads than actually doing work, and effectively locks up the machine. In extreme cases the node even drops out of the cluster because it does not respond to pings from other nodes in time.
To prevent this behavior, when I first configured the cluster I followed the advice of this blog post (and the instructor at the elasticsearch training I attended) and set the
type of each thread pool to
fixed. From the documentation:
fixedthread pool holds a fixed size of threads to handle the requests with a queue (optionally bounded) for pending requests that have no threads to service them.
I also set limits on the
queue_size of each thread pool. All was working well until write load started increasing and I started running into the exceptions described above.
Although I was correct to set the thread pool type to
fixed, I should have paid more attention to the
queue_size parameter. Again, from the documentation:
queue_sizeallows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to
-1which means its unbounded. When a request comes in and the queue is full, the
reject_policyparameter can control how it will behave. The default,
abort, will simply fail the request. Setting it to
callerwill cause the request to execute on an IO thread allowing to throttle the execution on the networking layer.
With this behavior in mind, the
EsRejectedExecutionException made perfect sense. Under high write load, not only was the thread pool exhausted, the queue was too. When this happens, the
reject_policy kicks in and rejects the write to the replica shard. elasticsearch then marks the replica as failed and it becomes unallocated.
Capping the size of the thread pool keeps the machine from locking up, but capping the size of the overflow queue causes writes to replica shards to be rejected if the queue grows too long. The solution is to keep the thread pool at a fixed size, but to set the queue size to unbounded, i.e.,
threadpool.index.queue_size: -1. Indeed, according to the documentation, newer versions of elasticsearch set the thread pool
size to 5 times the number of cores on the machine and the