At work, we have a number of REST services that are responsible for most of the content you see on IGN. For example, all our reviews and features are stored in an articles service. Data about games (release date, publisher, our review score, etc.) have their own service, and so on. These services use Mongo as their primary data store.
Sometimes data is retrieved by slug or ID. These fields are indexed in Mongo, so we just do a straightforward Mongo query. Mongo uses the index and the query is performant.
Other times, we need to do more complex queries. For example, to build this page, we need to do a query like:
- Get me games
- That were released in the US
- That have a review score greater than 0
- That are for the PS3
- That are RPGs
- And sort the list by review publish date
Even if we had indexes on all of these fields, Mongo can’t combine indexes dynamically. It will pick the “best” one, then try to resolve the other criteria the hard way. To speed up the query, we would need an index that covered all of the fields specified in the query. Unfortunately, that is impractical because of the diversity of our queries.
Our solution is to use elasticsearch. Whenever we do a write to Mongo, we also index the document in elasticsearch. Whenever we get a query that involves unindexed fields, we send the query to elasticsearch. elasticsearch is able to leverage multiple indexes per query, so responses are relatively fast even if the query is complex. As a bonus, filters in elasticsearch are cached, so future queries that involve the same criteria are even faster.
elasticsearch supports storing entire documents, so theoretically we could just return the document from elasticsearch. We wanted to keep Mongo as the authoritative database, so we just have elasticsearch return document IDs, which we then look up in Mongo. These are primary key lookups, so they are indexed and fast.
If you use elasticsearch in this way, keep these tips in mind:
Use filters instead of queries — From the elasticsearch Query DSL documentation:
Filters are very handy since they perform an order of magnitude better than plain queries since no scoring is performed and they are automatically cached.
Filters can be a great candidate for caching. Caching the result of a filter does not require a lot of memory, and will cause other queries executing against the same filter (same parameters) to be blazingly fast.
We don’t care about scoring, so we use a
match_all query and encode the rest of the search criteria as filters.
_source field ‐ To enable highlighting of results, elasticsearch stores a pristine copy of each indexed document by default. If you don’t plan on doing highlighting or retrieving entire documents from elasticsearch, disable the
_source field to save space in the index.
In one of our smaller indexes, disabling the
_source field reduced the index size by 58% (953.5mb to 399.9mb).
_all field ‐ To make searching more convenient, elasticsearch combines all searchable fields into a hidden
_all field by default. If you don’t specify a field to search, queries will be made against the
_all field. If you’re always going to specify the field to filter against, disable the
_all field to save space in the index.
In the same index I mentioned earlier, disabling the
_all field reduced the index size by another 30% (399.9mb to 276.5mb).