Last month I gave a talk at Velocity Conf NY with Ben Hughes titled Scaling Things That Don’t Scale: Scalability and Reliability at Airbnb. The slides and a video of the talk are both available online.
I recently spent a few weeks at work focusing on database performance. Specifically, I was looking to reduce load on our database by rewriting application code and queries to be more efficient. Some of the easiest wins were in code that counted things.
Counting is deceptively simple. Even a five year old can count — how hard could it be? The difficulty lies in counting efficiently. In a new application, you can often get away with counting inefficiently because your data set is small and database load is minimal. In a more mature application, there is more data, the database is busier, and there are outliers. Perhaps users on your blogging platform have five posts on average, but a few of your most dedicated users may have thousands of posts. This makes performance problems difficult to spot in development or even production. Ironically, your most dedicated users are the ones affected the most!
After working through a few inefficient queries, here are a few patterns that emerged. The examples below are for a Rails 3 app using ActiveRecord on MySQL, but the same SQL principles should apply to other languages and frameworks.
Today I open sourced hammerspace, a project that I’ve been working on for the last few months. The accompanying blog post explains the context and motivation. For me, there are three important takeaways.
First, make sure you know what your application is doing. We knew that application startup was slow, and that loading translations over the network was probably expensive, and that all of those translations probably took up a decent chunk of memory. But we never actually measured these things, so we never understood how slow or expensive or large these things were. And we didn’t anticipate how all of these things conspired together to cause such severe performance degradation.
Second, not all “memory” is equal. A lot of the comments on Hacker News suggested shared memory, or mmap’d files, or memcache or redis. Prior to hammerspace we were using the heap, and after hammerspace we are using the filesystem cache. These are all effectively “in memory” solutions, but they vary wildly in ease of implementation and speed of access from within ruby. Don’t just assume that because you’re keeping data “in memory” that it will automatically be fast.
Third, your time is not always best spent writing code. My co-worker and I joke that we don’t write more than 10 lines of code a week, and there’s some truth to it — at time of release, hammerspace is only 736 lines of production code, or 528 lines if you exclude whitespace and comments. And yet the impact was huge — 66% reduction in in-request garbage collection time, and a 17% reduction in application server response time overall. The real time and effort was spent hunting down the cause of the problem, evaluating potential solutions scientifically, deploying increasingly invasive experiments to production, collecting and analyzing data, iteratively refining the design, generalizing the solution, writing an extensive test suite, documenting everything, evangelizing the solution internally, and preparing the project for public release.
This has been the nature of my work recently. Every so often I still have to just crank out a bunch of code, but increasingly I’ve been spending more time on the things around the code. It’s gratifying to have something you can maintain, be proud of, and deploy with confidence. I’m very much looking forward to my next project.
I noticed that replica shards were sometimes becoming unallocated under high write load. Manually re-allocating them usually worked for a little bit, but after a little while they would become unallocated again. Looking at the logs, I found errors like these:
(The stack trace is truncated in the output above; see this gist for the full output.)
The fix turned out to be relatively simple (set
-1), but the explanation of the fix is a bit more involved.
One of the nodes in my elasticsearch cluster was acting kind of flaky. It would lose network connectivity just long enough to drop out of the cluster, then rejoin, then drop out again. This caused a lot of shard reallocation, which didn’t help things at all. After a few drop/rejoins, I decided to spin up a new cluster node and retire the flaky one.
This made the problem much, much worse. Over the next day or two, I experienced all kinds of strange behavior. Replica shards randomly stopped replicating from primaries, and attempts to allocate replicas to new nodes would sometimes fail. At times, even the cluster status command would fail, rendering elasticsearch-head unusable. These messages were very common in the logs:
(The stack traces are truncated in the output above; see this gist for the full output.)
Unfortunately, googling for the exceptions didn’t really help. Many posts either showed different exceptions, or they were unanswered. The closest thing I could find was this post suggesting not to mix versions of elasticsearch. I knew I was running the same version of elasticsearch on all my nodes (0.20.2), so it appeared to be a dead end.
I finally realized what was causing the problem — when I spun up the new cluster node, it installed a newer JVM, version 1.7.0_21. The rest of the cluster was spun up weeks ago and had version 1.7.0_17 of the JVM installed. Even though the version of elasticsearch was the same across nodes, the the JVM version was not. Upgrading the rest of the cluster to use the same JVM version fixed the problem.