A commonly-requested feature in search applications is autocomplete or search suggestions. The basic idea is to give users instant feedback as they’re typing. Implementations of this feature can vary — sometimes the suggestions are based on popular queries (e.g., Google’s Autocomplete), other times they may be previews of results (e.g., Google Instant). The suggestions may be relatively simple, or they can be extremely complex, taking into account things like the user’s search history, generally popular queries, top results, trending topics, spelling mistakes, etc. Building the latter can consume the resources of a company the size of Google, but it’s relatively easy to add simple results-based autocomplete to an existing elasticsearch search application.
If you have an existing search application, its index probably contains normalized tokens that more or less correspond with words. (See my blog post on indexing and analysis for more details about the tokenization process.) For example, the index may have the tokens:
[quick] [brown] [fox] [jump] [over] [lazy] [dog]
When a user performs a regular search for “quick”, the query matches the token in the index and the document is returned. However, an autocomplete “query” will not contain complete words. As the user types “quick”, your search application might receive queries for “q”, “qu”, “qui”, “quic”, and “quick”. (Exactly which queries are received depends on factors such as minimum character limits, event timeouts, how fast the user types, and if any typos are made.)
Prefix Query
To get a query like “qu” to match the document, you could use a Prefix query. Unfortunately, this approach is not very scalable because the Prefix query works by looking for all terms in the index that match the given prefix, then constructing a boolean query that contains each matching term. So if your index has terms like “quack”, “quote”, and “quarter”, the actual search performed for the query “qu” will be something like:
quick OR quack OR quote OR quarter
If there are a lot of terms in your index, this query can become quite unwieldy and eventually result in a “too many clauses” error.
Also, the Prefix query does not allow matching in the middle of words — for example, “ball” would not match “baseball”. You may be able to do this with a Wildcard query, but if the Prefix query won’t scale, the Wildcard query won’t either.
N-Grams
A better approach is to adjust the indexing process so the partial word queries match directly. So instead of just:
[quick]
the index should contain (at a minimum):
[q] [qu] [qui] [quic] [quick]
As it turns out, there’s a name for subsets of letters that make up words — they’re called n-grams. More importantly, there are several tokenizers and token filters that come with elasticsearch that will break words up into n-grams.
There are a few decisions you will have to make when integrating n-grams into your search application.
Tokenizer vs. Token Filter — The documentation for the NGram tokenizer simply says “A tokenizer of type nGram“, but it does what you’d expect — it takes a string and breaks it up into n-grams. It does not appear to do any word-based tokenization, so with the default settings (min_gram=1 and max_gram=2), the string “wx yz” is tokenized into:
[w] [x] [ ] [y] [z] [wx] [x ] [ y] [yz]
While there is probably a use case for this, I found the NGram token filter to be much more useful than the tokenizer for normal text. Since it’s just another token filter, it’s possible to still use the Standard tokenizer to split the text into words, then lowercase, remove stop words, and perform stemming on tokens before converting the resulting tokens into n-grams. Here’s a sample configuration of such an autocomplete analyzer:
{
"settings":{
"analysis":{
"analyzer":{
"autocomplete":{
"type":"custom",
"tokenizer":"standard",
"filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ]
}
}
}
}
}
Depending on the typical contents of the field(s) you’re autocompleting on, you may or may not want to remove stop words or perform stemming. For example, if you are autocompleting document titles, users may have the exact title in mind. In this case, removing stop words and/or performing stemming might make the results noisier.
NGram vs. Edge NGram — The NGram token filter generates all n-grams of the configured sizes for each token. For example, with the default settings (min_gram=1 and max_gram=2), “brown” is tokenized into:
[b] [r] [o] [w] [n] [br] [ro] [ow] [wn]
The Edge NGram token filter only generates n-grams from the beginning of the word:
[b] [br]
If you use edge n-grams, you will probably want to increase max_gram so you generate a few more terms. Setting it to 5 would yield:
[b] [br] [bro] [brow] [brown]
N-grams allow matching in the middle of words at the cost of more tokens and a larger index. If you’re not interested in matching in the middle of words and you want to save space, edge n-grams might be a better choice. For words like “brown”, it is unlikely that a user will search for “rown” (unless he makes a typo), and it is perhaps even undesirable that a search for “own” will match “brown”, so matching in the middle of words may not be important. On the other hand, your users may expect “ball” to match a compound word like “baseball”, in which case matching in the middle of words may be desirable.
The Edge NGram token filter can be configured to generate n-grams that include the end of the word instead of the beginning, although I’m not sure what the use case for this would be. Also, if you prefer tokenizers over token filters, there is an Edge variant of the NGram tokenizer called (you guessed it!) the Edge NGram tokenizer.
Which Fields to Index — Obviously, indexing n-grams adds a lot of terms to the index. If you can get away with only supporting autocomplete on name or title fields that are relatively short, you can save a lot of space by only indexing these fields as n-grams. On the other hand, if you need to be able to support partial word matches on full documents, you may have to index all fields as n-grams and greatly increase the size of your index, or use prefix or wildcard queries at the expense of performance.
elasticsearch has a Multi Field type that supports indexing a field multiple ways. This handy feature can be used to index a field “normally” and also as n-grams. For example, this partial document mapping shows how to index a name field “normally” and as n-grams using the autocomplete analyzer we defined earlier:
{
"articles":{
"properties":{
"name":{
"type":"multi_field",
"fields":{
"name":{
"type":"string"
},
"autocomplete":{
"analyzer":"autocomplete",
"type":"string"
}
}
}
}
}
}
Note that the inner mutli field that shares the name of the multi field itself is the “default” version of the field. When searching, you can access the default version by just using the name of the multi field itself, e.g., name. To search on the autocomplete version, use name.autocomplete.
Size of N-grams and Query Strategy — As mentioned previously, the default settings are min_gram=1 and max_gram=2. At first, I tried analyzing both the indexed terms and the query terms with these settings. The theory was that many documents in the index might match “f” or “fo”, but only documents that contained “fox” would match all of the unigrams and bigrams: “f”, “o”, “x”, “fo”, and “ox”. As the user kept typing, the best matches would float to the top because Lucene’s scoring algorithm prefers documents that match more terms. (I quickly realized that single letter terms were not very helpful because they matched so many documents, so I increased min_gram to 2.)
Unfortunately, I found the results to be rather unintuitive. I didn’t spend a lot of time tracking down exactly why, but one of the issues appeared to be because a match for “fo” in one word and a match for “ox” in another would count the same, so a document with the words “for” and “box” was just as likely to appear at the top as a document with the word “fox”. Also, ordering of the terms didn’t matter, so a hypothetical document that contained the string “xof” would also match just as well.
To fix these issues, I tried having the Match query generate a phrase query instead of a boolean query. This fixed the issue of terms appearing far apart and in order, but I lost the ability to omit words. For example, if the user typed “quick fox”, “quick brown fox” would no longer match. I tried to overcome this new problem by allowing some slop in the phrase query. However, slop is measured in terms which are each 2 characters long (because of the max_gram setting), so I had to use a very large slop to make omitting words possible, which defeated the purpose of using a phrase query in the first place.
Stumped, I reached out to a fellow blogger I met at an elasticsearch training who had already built an autocomplete feature. (Side note: the elasticsearch training was packed with information — well worth it, in my opinion.) He suggested that I index larger n-grams (min_gram=2 and max_gram=15) but skip converting the query into n-grams. So instead of relying on multiple bigram matches to add up to a good score, like this:
Query: j Analyzed Query Terms: [j] Document Terms: [ju] [um] [mp] Query: ju Analyzed Query Terms: <ju> Document Terms: <ju> [um] [mp] Query: jum Analyzed Query Terms: <ju> <um> Document Terms: <ju> <um> [mp] Query: jump Analyzed Query Terms: <ju> <um> <mp> Document Terms: <ju> <um> <mp>
queries are just matched directly:
Query: j Analyzed Query Terms: [j] Document Terms: [ju] [um] [mp] [jum] [ump] [jump] Query: ju Analyzed Query Terms: <ju> Document Terms: <ju> [um] [mp] [jum] [ump] [jump] Query: jum Analyzed Query Terms: <jum> Document Terms: [ju] [um] [mp] <jum> [ump] [jump] Query: jump Analyzed Query Terms: <jump> Document Terms: [ju] [um] [mp] [jum] [ump] <jump>
With this scheme, a phrase query isn’t required to keep word fragments in order, although if the autocomplete query contains multiple words, a phrase query may still be desirable to keep words in order. (See this stackoverflow post.)
In retrospect, I could have probably fixed the accuracy issue by increasing min_gram to cut down on false positives, but by the time I realized this I already had things working. My original approach might also allow using a smaller value of max_gram since long words do not have to fit in a single term. A smaller value of max_gram would create less tokens, which would make the index smaller. Depending on the size of the n-grams, it may also be able to tolerate minor typos inside words, which my current solution does not. Something to experiment with in the future, I suppose.