In my previous post about elasticsearch, I explained how the built-in Lucene scoring algorithm works. I also briefly mentioned the possibility of assigning boosts to different document fields or query terms to influence the scoring algorithm. In this post I will cover boosting in greater detail.
The first question I had when I started working with scoring was: why do I need to boost at all? Isn’t Lucene’s scoring algorithm tried and true? It seemed to work pretty well on the handful of test documents that I put in the index. But as I added more and more documents, the results got worse, and I realized why boosting is necessary. Lucene’s scoring algorithm does great in the general case, but it doesn’t know anything about your specific subject domain or application. Boosting allows you to compensate for different document types, add domain-specific logic, and incorporate additional signals.
At work, I’ve been building a search application using elasticsearch. In my first post on the topic, I talked about indexing and analysis and ignored scoring entirely. Scoring is a very complex topic, but I’ll try to address some of the basics in this post. I’ll also cover a specific scoring-related issue with elasticsearch’s
Up until now, I’ve been using elasticsearch as a way to speed up filtering, not as a search engine for human use. Using elasticsearch for filtering is relatively easy — all the inputs are normalized strings, dates, or numbers, documents either match or they don’t, and the order in which documents are returned is specified by the client.
My latest project is to build a search application for human use. Building search for humans is a lot harder. Instead of filtering by known fields with specific values, you have to match free text. Instead of a structured query, you get one text box. The result the user was searching for should appear in the first few results, and because Google is ubiquitous, users expect the results to be just as good.