org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of [org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler]

I noticed that replica shards were sometimes becoming unallocated under high write load. Manually re-allocating them usually worked for a little bit, but after a little while they would become unallocated again. Looking at the logs, I found errors like these:

[2013-04-28 19:52:08,944][WARN ][action.index             ] [hostname] Failed to perform index on replica [logstash-2013.04.28][5]
org.elasticsearch.transport.RemoteTransportException: [otherhost][inet[/10.0.0.1:9300]][index/replica]
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of [org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:35)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
        ...

(The stack trace is truncated in the output above; see this gist for the full output.)

The fix turned out to be relatively simple (set threadpool.index.queue_size to -1), but the explanation of the fix is a bit more involved.

Continue reading

org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream

One of the nodes in my elasticsearch cluster was acting kind of flaky. It would lose network connectivity just long enough to drop out of the cluster, then rejoin, then drop out again. This caused a lot of shard reallocation, which didn’t help things at all. After a few drop/rejoins, I decided to spin up a new cluster node and retire the flaky one.

This made the problem much, much worse. Over the next day or two, I experienced all kinds of strange behavior. Replica shards randomly stopped replicating from primaries, and attempts to allocate replicas to new nodes would sometimes fail. At times, even the cluster status command would fail, rendering elasticsearch-head unusable. These messages were very common in the logs:

[2013-04-24 00:59:53,539][WARN ][action.index ] [hostname] Failed to perform index on replica [logstash-2013.04.24][3]
org.elasticsearch.transport.RemoteTransportException: Failed to deserialize exception response from stream
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream
        at org.elasticsearch.transport.netty.MessageChannelHandler.handlerResponseError(MessageChannelHandler.java:171)
        ...
Caused by: java.io.StreamCorruptedException: unexpected end of block data
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1369)
        ...
[2013-04-24 00:59:53,544][WARN ][cluster.action.shard ] [hostname] sending failed shard for [logstash-2013.04.24][3], node[obkgtvEVS3q59PlJWfY03g], [R], s[INITIALIZING], reason
[Failed to perform [index] on replica, message [RemoteTransportException[Failed to deserialize exception response from stream]; nested: TransportSerializationException[Failed to deserialize exception response from stream]; nested: StreamCorruptedException[unexpected end of block data]; ]]
[2013-04-24 00:59:53,544][WARN ][transport.netty ] [hostname] Message not fully read (response) for [4541427] handler org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$4@7a8295ea, error [true], resetting

(The stack traces are truncated in the output above; see this gist for the full output.)

Unfortunately, googling for the exceptions didn’t really help. Many posts either showed different exceptions, or they were unanswered. The closest thing I could find was this post suggesting not to mix versions of elasticsearch. I knew I was running the same version of elasticsearch on all my nodes (0.20.2), so it appeared to be a dead end.

I finally realized what was causing the problem — when I spun up the new cluster node, it installed a newer JVM, version 1.7.0_21. The rest of the cluster was spun up weeks ago and had version 1.7.0_17 of the JVM installed. Even though the version of elasticsearch was the same across nodes, the the JVM version was not. Upgrading the rest of the cluster to use the same JVM version fixed the problem.