I recently spent a few weeks at work focusing on database performance. Specifically, I was looking to reduce load on our database by rewriting application code and queries to be more efficient. Some of the easiest wins were in code that counted things.
Counting is deceptively simple. Even a five year old can count — how hard could it be? The difficulty lies in counting efficiently. In a new application, you can often get away with counting inefficiently because your data set is small and database load is minimal. In a more mature application, there is more data, the database is busier, and there are outliers. Perhaps users on your blogging platform have five posts on average, but a few of your most dedicated users may have thousands of posts. This makes performance problems difficult to spot in development or even production. Ironically, your most dedicated users are the ones affected the most!
After working through a few inefficient queries, here are a few patterns that emerged. The examples below are for a Rails 3 app using ActiveRecord on MySQL, but the same SQL principles should apply to other languages and frameworks.
Use count or size instead of length
If you only need to know how many of something there are, use
length. For example, if you want to know how many posts a user has, use
user.posts.count. This issues a relatively efficient
SELECT COUNT(*) FROM `posts` WHERE (`posts`.user_id = 8)
user.posts.length gives you the same result, but under the hood ActiveRecord issues a
SELECT * query and creates an object for each post, then counts the objects. Not only is this more expensive for the database, it’s more expensive in ruby as well.
SELECT `posts`.* FROM `posts` WHERE (`posts`.user_id = 8)
Use exists? instead of empty? or comparing count to 0
Sometimes you may not even need to know the exact count — you may only care whether there are some or none of something. In these situations, use
exists? instead of
count > 0. To continue the previous example, if you want to know whether or not a user has any posts, use
user.posts.exists?. This issues the following query:
SELECT 1 FROM `posts` WHERE (`posts`.user_id = 8) LIMIT 1
user.posts.empty? essentially does the same thing as
user.posts.count > 0, which issues a
COUNT(*) query. This has similar performance when the collection is small, i.e., the user has only a few posts. However, if the user has many posts, the database will be forced to count them all. With the
SELECT 1 ... LIMIT 1 query, the database can answer the query as soon as the first post is found.
Use offset(x).exists? instead of comparing count to x
What if you need to know if there are a few of something? For example, you want to know if a user has 3 or more posts; if so, they’re no longer considered a new user. You can still use and benefit from
user.posts.offset(3).exists?, which issues the following query:
SELECT 1 FROM `posts` WHERE (`posts`.user_id = 8) LIMIT 1 OFFSET 3
As before, the database can answer the query as soon as it finds 3 posts. Calling
user.posts.count > 3 is functionally equivalent, but if the user has many posts, the database will count them all. The difference can be dramatic — the graph below shows the query time improvement when I deployed a change to convert a
user.posts.count > 3 call to