Adding Autocomplete to an elasticsearch Search Application

A commonly-requested feature in search applications is autocomplete or search suggestions. The basic idea is to give users instant feedback as they’re typing. Implementations of this feature can vary — sometimes the suggestions are based on popular queries (e.g., Google’s Autocomplete), other times they may be previews of results (e.g., Google Instant). The suggestions may be relatively simple, or they can be extremely complex, taking into account things like the user’s search history, generally popular queries, top results, trending topics, spelling mistakes, etc. Building the latter can consume the resources of a company the size of Google, but it’s relatively easy to add simple results-based autocomplete to an existing elasticsearch search application.

If you have an existing search application, its index probably contains normalized tokens that more or less correspond with words. (See my blog post on indexing and analysis for more details about the tokenization process.) For example, the index may have the tokens:

[quick] [brown] [fox] [jump] [over] [lazy] [dog]

When a user performs a regular search for “quick”, the query matches the token in the index and the document is returned. However, an autocomplete “query” will not contain complete words. As the user types “quick”, your search application might receive queries for “q”, “qu”, “qui”, “quic”, and “quick”. (Exactly which queries are received depends on factors such as minimum character limits, event timeouts, how fast the user types, and if any typos are made.)

Prefix Query

To get a query like “qu” to match the document, you could use a Prefix query. Unfortunately, this approach is not very scalable because the Prefix query works by looking for all terms in the index that match the given prefix, then constructing a boolean query that contains each matching term. So if your index has terms like “quack”, “quote”, and “quarter”, the actual search performed for the query “qu” will be something like:

quick OR quack OR quote OR quarter

If there are a lot of terms in your index, this query can become quite unwieldy and eventually result in a “too many clauses” error.

Also, the Prefix query does not allow matching in the middle of words — for example, “ball” would not match “baseball”. You may be able to do this with a Wildcard query, but if the Prefix query won’t scale, the Wildcard query won’t either.

N-Grams

A better approach is to adjust the indexing process so the partial word queries match directly. So instead of just:

[quick]

the index should contain (at a minimum):

[q] [qu] [qui] [quic] [quick]

As it turns out, there’s a name for subsets of letters that make up words — they’re called n-grams. More importantly, there are several tokenizers and token filters that come with elasticsearch that will break words up into n-grams.

There are a few decisions you will have to make when integrating n-grams into your search application.

Tokenizer vs. Token Filter — The documentation for the NGram tokenizer simply says “A tokenizer of type nGram“, but it does what you’d expect — it takes a string and breaks it up into n-grams. It does not appear to do any word-based tokenization, so with the default settings (min_gram=1 and max_gram=2), the string “wx yz” is tokenized into:

[w] [x] [ ] [y] [z] [wx] [x ] [ y] [yz]

While there is probably a use case for this, I found the NGram token filter to be much more useful than the tokenizer for normal text. Since it’s just another token filter, it’s possible to still use the Standard tokenizer to split the text into words, then lowercase, remove stop words, and perform stemming on tokens before converting the resulting tokens into n-grams. Here’s a sample configuration of such an autocomplete analyzer:

{
  "settings":{
    "analysis":{
      "analyzer":{
        "autocomplete":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ] 
        }
      }
    }
  }
}

Depending on the typical contents of the field(s) you’re autocompleting on, you may or may not want to remove stop words or perform stemming. For example, if you are autocompleting document titles, users may have the exact title in mind. In this case, removing stop words and/or performing stemming might make the results noisier.

NGram vs. Edge NGram — The NGram token filter generates all n-grams of the configured sizes for each token. For example, with the default settings (min_gram=1 and max_gram=2), “brown” is tokenized into:

[b] [r] [o] [w] [n] [br] [ro] [ow] [wn]

The Edge NGram token filter only generates n-grams from the beginning of the word:

[b] [br]

If you use edge n-grams, you will probably want to increase max_gram so you generate a few more terms. Setting it to 5 would yield:

[b] [br] [bro] [brow] [brown]

N-grams allow matching in the middle of words at the cost of more tokens and a larger index. If you’re not interested in matching in the middle of words and you want to save space, edge n-grams might be a better choice. For words like “brown”, it is unlikely that a user will search for “rown” (unless he makes a typo), and it is perhaps even undesirable that a search for “own” will match “brown”, so matching in the middle of words may not be important. On the other hand, your users may expect “ball” to match a compound word like “baseball”, in which case matching in the middle of words may be desirable.

The Edge NGram token filter can be configured to generate n-grams that include the end of the word instead of the beginning, although I’m not sure what the use case for this would be. Also, if you prefer tokenizers over token filters, there is an Edge variant of the NGram tokenizer called (you guessed it!) the Edge NGram tokenizer.

Which Fields to Index — Obviously, indexing n-grams adds a lot of terms to the index. If you can get away with only supporting autocomplete on name or title fields that are relatively short, you can save a lot of space by only indexing these fields as n-grams. On the other hand, if you need to be able to support partial word matches on full documents, you may have to index all fields as n-grams and greatly increase the size of your index, or use prefix or wildcard queries at the expense of performance.

elasticsearch has a Multi Field type that supports indexing a field multiple ways. This handy feature can be used to index a field “normally” and also as n-grams. For example, this partial document mapping shows how to index a name field “normally” and as n-grams using the autocomplete analyzer we defined earlier:

{
  "articles":{
    "properties":{
      "name":{
        "type":"multi_field",
        "fields":{
          "name":{
            "type":"string"
          },
          "autocomplete":{
            "analyzer":"autocomplete",
            "type":"string"
          }
        }
      }
    }
  }
}

Note that the inner mutli field that shares the name of the multi field itself is the “default” version of the field. When searching, you can access the default version by just using the name of the multi field itself, e.g., name. To search on the autocomplete version, use name.autocomplete.

Size of N-grams and Query Strategy — As mentioned previously, the default settings are min_gram=1 and max_gram=2. At first, I tried analyzing both the indexed terms and the query terms with these settings. The theory was that many documents in the index might match “f” or “fo”, but only documents that contained “fox” would match all of the unigrams and bigrams: “f”, “o”, “x”, “fo”, and “ox”. As the user kept typing, the best matches would float to the top because Lucene’s scoring algorithm prefers documents that match more terms. (I quickly realized that single letter terms were not very helpful because they matched so many documents, so I increased min_gram to 2.)

Unfortunately, I found the results to be rather unintuitive. I didn’t spend a lot of time tracking down exactly why, but one of the issues appeared to be because a match for “fo” in one word and a match for “ox” in another would count the same, so a document with the words “for” and “box” was just as likely to appear at the top as a document with the word “fox”. Also, ordering of the terms didn’t matter, so a hypothetical document that contained the string “oxfo” would also match just as well.

To fix these issues, I tried having the Match query generate a phrase query instead of a boolean query. This fixed the issue of terms appearing far apart and in order, but I lost the ability to omit words. For example, if the user typed “quick fox”, “quick brown fox” would no longer match. I tried to overcome this new problem by allowing some slop in the phrase query. However, slop is measured in terms which are each 2 characters long (because of the max_gram setting), so I had to use a very large slop to make omitting words possible, which defeated the purpose of using a phrase query in the first place.

Stumped, I reached out to a fellow blogger I met at an elasticsearch training who had already built an autocomplete feature. (Side note: the elasticsearch training was packed with information — well worth it, in my opinion.) He suggested that I index larger n-grams (min_gram=2 and max_gram=15) but skip converting the query into n-grams. So instead of relying on multiple bigram matches to add up to a good score, like this:

Query: j
Analyzed Query Terms: [j]
Document Terms: [ju] [um] [mp]

Query: ju
Analyzed Query Terms: <ju>
Document Terms: <ju> [um] [mp]

Query: jum
Analyzed Query Terms: <ju> <um>
Document Terms: <ju> <um> [mp]

Query: jump
Analyzed Query Terms: <ju> <um> <mp>
Document Terms: <ju> <um> <mp>

queries are just matched directly:

Query: j
Analyzed Query Terms: [j]
Document Terms: [ju] [um] [mp] [jum] [ump] [jump]

Query: ju
Analyzed Query Terms: <ju>
Document Terms: <ju> [um] [mp] [jum] [ump] [jump]

Query: jum
Analyzed Query Terms: <jum>
Document Terms: [ju] [um] [mp] <jum> [ump] [jump]

Query: jump
Analyzed Query Terms: <jump>
Document Terms: [ju] [um] [mp] [jum] [ump] <jump>

With this scheme, a phrase query isn’t required to keep word fragments in order, although if the autocomplete query contains multiple words, a phrase query may still be desirable to keep words in order. (See this stackoverflow post.)

For completeness, here’s the same analyzer configuration from before, but with custom settings on the ngram filter:

{
  "settings":{
    "analysis":{
      "analyzer":{
        "autocomplete":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":[ "standard", "lowercase", "stop", "kstem", "ngram" ] 
        }
      },
      "filter":{
        "ngram":{
          "type":"ngram",
          "min_gram":2,
          "max_gram":15
        }
      }
    }
  }
}

In retrospect, I could have probably fixed the accuracy issue by increasing min_gram to cut down on false positives, but by the time I realized this I already had things working. My original approach might also allow using a smaller value of max_gram since long words do not have to fit in a single term. A smaller value of max_gram would create less tokens, which would make the index smaller. Depending on the size of the n-grams, it may also be able to tolerate minor typos inside words, which my current solution does not. Something to experiment with in the future, I suppose.

Tagged

15 thoughts on “Adding Autocomplete to an elasticsearch Search Application

  1. Putting up the queries you used in this would make it a lot more readable..

    i.e

    “To fix these issues, I tried having the Match query generate a phrase query instead of a boolean query.”…

    Thanks

  2. Great Post! I’m new with ES and your posts about it are really great. I agree with the previous comment too, putting up the queries will help us so much.

    Thanks!

    • I haven’t used the new completion suggester yet, so I can’t comment definitively. From a quick glance, it seems that it only works on prefixes. For most uses cases that is probably enough but some use cases may require matching in the middle of words. I don’t know how much storage it uses, but I would guess that it uses less than the n-gram approach. Seems like it’s worth a try; you can always try n-grams if the completion suggester doesn’t do what you need.

      • I played around with the completion suggester during an elasticsearch meetup (code at https://github.com/miku/es-hf-2014-05-28) – it is very fast and you can define payloads at indexing time, that you might want pass along with the suggestions. For data sets in the order of a few million items (few gigabytes) the (edge) ngram approach seems not much slower, though.

  3. “analysis” field is not supported with the “settings” while creating an index.
    you can supply “mappings” in the “settings but you have to supply “analysis” in “index”. This seems to be confusing on part of Elastic search as to why will they separate the analysis part from settings.
    http://stackoverflow.com/questions/17569938/analyzer-not-found-exception-while-creating-an-index-with-mapping-and-settings?rq=1
    This seems to work :-
    {
    “index” : {
    “analysis”:{
    “analyzer”:{
    “my_autocomplete”:{
    “type”:”custom”,
    “tokenizer”:”whitespace”,
    “filter”:[ "standard", "lowercase", "stop", "kstem", "ngram" ]
    }
    }
    }
    },
    “settings”: {
    “mappings”: {
    “articles”:{
    “properties”:{
    “name”:{
    “type”:”multi_field”,
    “fields”:{
    “name”:{
    “type”:”string”
    },
    “autocomplete”:{
    “analyzer”:”my_autocomplete”,
    “type”:”string”
    }
    }
    }
    }
    }
    }
    }
    }

  4. If I set max_gram to 15 and the input term length is 30 characters for example, what will happen in this case? Will ES return any results?

    • In my example, I am still tokenizing, so an input of 30 characters should still match if it is comprised of several words that are each 15 characters or less. A single term of greater than 15 characters would not be expected to match.

  5. Very nice post, did you have time to do more experiment about typo tolerance & ngrams? I am very interested about this problem, I am the CTO and co-founder of Algolia and I have wrote a new search engine that does not rely on ngram to perform suggestion.

    We have designed a new datastructure that basically only index words and allow to perform prefix and typo tolerance queries very fast. This approach leads to an index about 5-10% bigger than if it contained only words, far less than indexing all n-grams. This also reduces a lot the cost of doing a fuzzy prefix match in the data-structure using a Levenshtein Distance (since there is less entries).

    The result is an engine with very few configuration, you just need to configure the order of importance of your object attributes and a custom sort if you have one.

    I would really appreciate if you have some time to try it and give me feedback.

  6. All this make sense except how to order all the results which has fox in the field. As far I can see the lucene search doesn’t encounter the position of the term in the field. As example if you have 20 documents, with equal length of the search field and all they contain the word quick, but some of them the word is in the beginning of the sentence, but other are a little bit back the score for all documents will be the same, so they are displayed by order of insertion. In the real world, when I type “qui” I would expect results which start with that term to be in first place and other results below.
    Especially this is valid for users names, where some people could start with ‘Jo’ (John) but their family could be also “Johanson” and when I start typing I would expect to see names starting with ‘Jo’, but not ‘Martin Johanson’ in the top results.

    To illustrate this with a sample: http://pastebin.com/ZC6vH7kb

    As you can see the document with ID=3 is at the end, while the document and scores are equal.

    • Ideally Lucene should consider position of the queried term within field while calculating score. Since it doesn’t you may use simple natural sort on field of the search results if your application allows.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>