I noticed that replica shards were sometimes becoming unallocated under high write load. Manually re-allocating them usually worked for a little bit, but after a little while they would become unallocated again. Looking at the logs, I found errors like these:
(The stack trace is truncated in the output above; see this gist for the full output.)
The fix turned out to be relatively simple (set
-1), but the explanation of the fix is a bit more involved.
One of the nodes in my elasticsearch cluster was acting kind of flaky. It would lose network connectivity just long enough to drop out of the cluster, then rejoin, then drop out again. This caused a lot of shard reallocation, which didn’t help things at all. After a few drop/rejoins, I decided to spin up a new cluster node and retire the flaky one.
This made the problem much, much worse. Over the next day or two, I experienced all kinds of strange behavior. Replica shards randomly stopped replicating from primaries, and attempts to allocate replicas to new nodes would sometimes fail. At times, even the cluster status command would fail, rendering elasticsearch-head unusable. These messages were very common in the logs:
(The stack traces are truncated in the output above; see this gist for the full output.)
Unfortunately, googling for the exceptions didn’t really help. Many posts either showed different exceptions, or they were unanswered. The closest thing I could find was this post suggesting not to mix versions of elasticsearch. I knew I was running the same version of elasticsearch on all my nodes (0.20.2), so it appeared to be a dead end.
I finally realized what was causing the problem — when I spun up the new cluster node, it installed a newer JVM, version 1.7.0_21. The rest of the cluster was spun up weeks ago and had version 1.7.0_17 of the JVM installed. Even though the version of elasticsearch was the same across nodes, the the JVM version was not. Upgrading the rest of the cluster to use the same JVM version fixed the problem.