elasticsearch Index Templates

I’ve been working on setting up an elasticsearch cluster for logstash. Since logstash has unique write throughput and storage requirements, there are a few recommended index settings for logstash — see this wiki page and this blog post.

By default, logstash creates a new index for each day’s logs, so these index settings have to be configured using an index template. If an index is configured directly, the settings would only apply to the current day’s index and tomorrow’s index would be created with the default settings again. An index template applies to all new indexes that match a pattern such as logstash-*, which will match logstash-2013.03.18, logstash-2013.03.19, etc.

As with most settings in elasticsearch, there are two ways to configure index templates. They can be configured through the API, or they can be stored in a configuration file. The latter is helpful when configuring a cluster that is not up and running. In my case, I am using chef to configure the elasticsearch nodes, so it’s not guaranteed that the cluster is up when the recipe executes.

Unfortunately, it took me a long time to figure out how to get the configuration file method working. As this thread suggests, I put the file in the right place — #{config.path}/templates/logstash_template.json — and I made sure to configure each master-eligible node. I even read through the feature and the associated commit to make sure the documentation was in sync with the code. elasticsearch just wasn’t picking up the settings.

Continue reading

elasticsearch EC2 Discovery

On a private network, elasticsearch nodes will automatically discover peers using multicast. Nodes configured with a common cluster name will magically find each other when they boot up and form a cluster. It’s wonderful, magical, and a little scary — elasticsearch nodes will likely be the first to become sentient in a robot uprising.

On AWS and most other clouds, multicast is not allowed. (Rackspace supports broadcast and multicast.) This leaves two options: use unicast discovery and explicitly list out each node in discovery.zen.ping.unicast.hosts, or use the EC2 discovery method provided by the cloud-aws plugin. The former is fairly brittle due to the dynamic nature of the cloud. The latter uses the EC2 API to enumerate hosts, essentially populating discovery.zen.ping.unicast.hosts dynamically. This guide does a great job of covering the process, so I won’t go into the details here. Instead, I will try to offer a few tips on the setup process.

Continue reading

Adding Autocomplete to an elasticsearch Search Application

A commonly-requested feature in search applications is autocomplete or search suggestions. The basic idea is to give users instant feedback as they’re typing. Implementations of this feature can vary — sometimes the suggestions are based on popular queries (e.g., Google’s Autocomplete), other times they may be previews of results (e.g., Google Instant). The suggestions may be relatively simple, or they can be extremely complex, taking into account things like the user’s search history, generally popular queries, top results, trending topics, spelling mistakes, etc. Building the latter can consume the resources of a company the size of Google, but it’s relatively easy to add simple results-based autocomplete to an existing elasticsearch search application.

Continue reading

Advanced Scoring in elasticsearch

In my previous post about elasticsearch, I explained how the built-in Lucene scoring algorithm works. I also briefly mentioned the possibility of assigning boosts to different document fields or query terms to influence the scoring algorithm. In this post I will cover boosting in greater detail.

Why Boost?

The first question I had when I started working with scoring was: why do I need to boost at all? Isn’t Lucene’s scoring algorithm tried and true? It seemed to work pretty well on the handful of test documents that I put in the index. But as I added more and more documents, the results got worse, and I realized why boosting is necessary. Lucene’s scoring algorithm does great in the general case, but it doesn’t know anything about your specific subject domain or application. Boosting allows you to compensate for different document types, add domain-specific logic, and incorporate additional signals.

Continue reading

view on CentOS

When editing a file, it’s often handy to have a related file open for reference. For example, if I’m using a library function I might have the function definition open in a separate window. Or if I’m editing a configuration file on one server, I might want to see what the same configuration file looks like on a different server.

Vim has a read-only mode (vim -R) that is perfect for this purpose. From the vim man page:

Read-only mode. The ‘readonly’ option will be set. You can still edit the buffer, but will be prevented from accidently overwriting a file. If you do want to overwrite a file, add an exclamation mark to the Ex command, as in :w!. The -R option also implies the -n option.

The -n option stands for “no swap file” — since you will not be editing the file, there is no reason to create a swap file. This is a helpful because it allows you to open a file in read-only mode multiple times and vim will not complain.

Instead of typing vim -R, I typically invoke vim using the view command, which does the same thing:

Vim behaves differently, depending on the name of the command (the executable may still be the same file).

The “normal” way, everything is default.

Start in read-only mode. You will be protected from writing the files. Can also be done with the -R argument.

On Ubuntu, vi, vim, and view are all symlinks managed by the alternatives system. When the vim package is installed, all three commands point at the real binary, /usr/bin/vim.basic. Since all three commands invoke the same binary, the behavior is consistent across commands (except view invokes vim in read-only mode, as expected).

On CentOS, the situation is more complicated. The vim-minimal package provides /bin/vi and /bin/view, and the latter is just a symlink to the former. The vim-enhanced package provides /usr/bin/vim, then sets up a bash alias vi=vim in /etc/profile.d/vim.sh. Unfortunately, there is no special treatment for the view command, so when you invoke vi or vim you get the enhanced vim, but when you invoke view you get the minimal vim. This can be annoying — for example, if you open a file using vi or vim you will get syntax highlighting, but if you open a file using view, you won’t.

Fortunately, the fix is fairly simple. Just add the following lines to your .bashrc:

if [ -x /usr/bin/vim ]; then
    alias view='vim -R'

This sets up an alias for the view command, so vi, vim, and view all invoke the same binary and the behavior is consistent across all three.