Testing Lucene Analyzers with elasticsearch

Up until now, I’ve been using elasticsearch as a way to speed up filtering, not as a search engine for human use. Using elasticsearch for filtering is relatively easy — all the inputs are normalized strings, dates, or numbers, documents either match or they don’t, and the order in which documents are returned is specified by the client.

My latest project is to build a search application for human use. Building search for humans is a lot harder. Instead of filtering by known fields with specific values, you have to match free text. Instead of a structured query, you get one text box. The result the user was searching for should appear in the first few results, and because Google is ubiquitous, users expect the results to be just as good.

Indexing and Analysis

Putting scoring aside for a moment, the first step is to find matches. Since humans don’t always use the same exact words to describe something, the words need to be massaged a bit to normalize case and remove suffixes such as -ed or -ing. This massaging is called analysis, and it is performed on both the documents being searched and the search query itself.

Analyzing a lot of documents takes time, so it is usually done up front. This process is called indexing. Analyzed documents are stored in a format that is efficient for searching, called an index.

Consider a document that contains the word “Searching”. An analyzer might lowercase the word and remove the -ing suffix, leaving just “search”. This analyzed term is what gets stored in the index. Later, a user might come along and search for the word “searched”. The query is similarly analyzed, yielding “search” as the search term. This term matches the previously-indexed document, so the document is returned to the user.

For all of this to work, the analyzer used during indexing and the analyzer used on the query must be compatible. If the analyzer used during indexing converts all words to uppercase but the analyzer used on the query converts all words to lowercase, there will never be a match!

Fortunately, you don’t have to code up any of this yourself — elasticsearch (and the Lucene library it uses under the hood) provides all of this functionality. But as with any tool, and especially a tool as deep as elasticsearch, you’ll be able to use it more effectively if you understand how it works.

Analyzers

The analysis process has three parts. To help illustrate the process, I’m going to use the following raw text as an example. (Processing a HTML or PDF document to obtain the title, main text, and other fields to be analyzed is called parsing and is beyond the scope of analysis. Let’s assume a parser has already extracted this text out of a larger document.)

<h1>Building a <em>top-notch</em> search engine</h1>

First, character filters pre-process the raw text to be analyzed. For example, the HTML Strip character filter removes HTML. We’re now left with this text:

Building a top-notch search engine

Next, a tokenizer breaks up the pre-processed text into tokens. Tokens are usually words, but different tokenizers handle corner cases such as “top-notch” differently. Some tokenizers such as the Standard tokenizer consider dashes to be word boundaries, so “top-notch” would be two tokens (“top” and “notch”). Other tokenizers such as the Whitespace tokenizer only considers whitespace to be word boundaries, so “top-notch” would be a single token. There are also some unusual tokenizers like the NGram tokenizer that generates tokens that are partial words. Assuming a dash is considered a word boundary, we now have:

[Building] [a] [top] [notch] [search] [engine]

Finally, token filters perform additional processing on tokens, such as removing suffixes (called stemming) and converting characters to lower case. The final sequence of tokens might end up looking like this:

[build] [a] [top] [notch] [search] [engine]

The combination of a tokenizer and zero or more filters makes up an analyzer. elasticsearch ships with a few analyzers built-in. The Standard analyzer, which consists of a Standard tokenizer and the Standard, Lowercase, and Stop token filters, is used by default.

Analyzers can do more complex manipulations to achieve better results. For example, an analyzer might use a token filter to spell check words or introduce synonyms so that searching for “saerch” or “find” will both return the document that contained “Searching”. There are also different implementations of similar features to choose from. elasticsearch ships with several different stemming algorithms: Porter Stem, Snowball, and KStem.

In addition to choosing between the included analyzers, you can create your own custom analyzer by chaining together an existing tokenizer and zero or more filters. The Standard analyzer doesn’t do stemming, so you might want to create a custom analyzer that includes a stemming token filter.

With so many possibilities, you will want to test out different combinations and see what works best for your situation. Fortunately, elasticsearch makes configuring and testing analyzers relatively simple.

Configuring Analyzers

elasticsearch supports multiple indexes per server instance. Analyzers can be configured on a per-index basis — just include the analyzer configuration when you create the index. (The analyzer configuration didn’t seem to stick if I tried to update the settings on an existing test index by doing a PUT to /test/_settings. I’m not sure if this is a bug or known behavior. Update: you can add new analyzers, but the index must be closed first — see this post.) You can create and delete indexes without restarting elasticsearch.

The following example creates an index called test and configures a custom analyzer as the default for both indexing and searching. In this case, the custom analyzer is similar to the Standard analyzer, except it adds the KStem token filter.

curl -XPUT http://localhost:9200/test -d '{
  "settings":{
    "analysis":{
      "analyzer":{
        "default":{
          "type":"custom",
          "tokenizer":"standard",
          "filter":[ "standard", "lowercase", "stop", "kstem" ]
        }
      }
    }
  }
}'

When you’re done with this index, you can delete it just as easily.

curl -XDELETE http://localhost:9200/test

This is an API call, so there’s no warning prompt of any kind. Make sure you type the name of the index correctly, and don’t test on a production instance of elasticsearch!

Testing Analyzers

Normally, analysis is an opaque process — text is analyzed and immediately stored as terms in the index or used in a query. Fortunately, elasticsearch provides a helpful endpoint that allows you to run some sample text through an analyzer and inspect the resulting tokens.

Let’s try analyzing a word with a suffix to make sure KStem is correctly configured. (The pretty parameter just tells elasticsearch to format the JSON in a human-readable way. Don’t use this parameter in production.)

curl 'http://localhost:9200/test/_analyze?text=Searched&pretty'
{
  "tokens" : [ {
    "token" : "search",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

The -ed suffix was removed, so it looks like it’s working!

You can also specify a specific analyzer to use with the analyzer parameter. This allows you to configure and test a number of different analyzers at once. Just give each one a different name (besides default) in the configuration and reference them in the analyzer parameter.

The standard analyzer is one of the built-in ones, so let’s see what it does with the same text:

curl 'http://localhost:9200/test/_analyze?analyzer=standard&text=Searched&pretty'
{
  "tokens" : [ {
    "token" : "searched",
    "start_offset" : 0,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 1
  } ]
}

As expected, the Standard analyzer didn’t do any stemming, but it did lowercase the word.

Of course, the true test of an analyzer is how well it performs on your documents. I’m still working on indexing all the documents for my project, but I hope to blog more on the topic soon!

Tagged ,

3 thoughts on “Testing Lucene Analyzers with elasticsearch

  1. I’m working through a lot of the same stuff you are right now. Your post provided a good framework to build from, which I’m going to leverage. I hope to have a good analysis for my data (which is writers for Contently), but not sure how much time I’ll have. I’m looking forward to your results…

  2. Thanks for writing up this article! I especially enjoyed the step by step guide to debugging and testing analyzers from curl. It’s all information available from the elasticsearch reference site, but this puts it together is a much easier to follow form.

    Cheers

  3. Thanks for posting this, even 2 years later it is useful. It doesn’t seem to work anymore to close the index before making changes_to/adding analyzers … you must delete the index at least on my 1.3.3 install

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>