Advanced Scoring in elasticsearch

In my previous post about elasticsearch, I explained how the built-in Lucene scoring algorithm works. I also briefly mentioned the possibility of assigning boosts to different document fields or query terms to influence the scoring algorithm. In this post I will cover boosting in greater detail.

Why Boost?

The first question I had when I started working with scoring was: why do I need to boost at all? Isn’t Lucene’s scoring algorithm tried and true? It seemed to work pretty well on the handful of test documents that I put in the index. But as I added more and more documents, the results got worse, and I realized why boosting is necessary. Lucene’s scoring algorithm does great in the general case, but it doesn’t know anything about your specific subject domain or application. Boosting allows you to compensate for different document types, add domain-specific logic, and incorporate additional signals.

Before I can give specific examples, I need to explain a little bit about the search application I’ve been working on. The application powers site search for IGN, a site about “gaming, entertainment, and everything guys enjoy.” We currently index four main types of content from our backend APIs: articles, videos, wiki pages, and “objects” (games, movies, shows, etc.) By default, search results of all types are returned in a single aggregate listing.

Compensating for Different Document Types — Lucene’s scoring algorithm works very well if your documents are homogeneous. But if you have different document types, you may need to make some manual adjustments. For example, we index both articles and videos. Articles have a lot of textual content — the entire body of the article — but videos only have a short description field. By default Lucene will prefer a match in a shorter field, so when videos match they tend to score higher than articles.

Since elasticsearch supports searching across multiple indexes, it may have been possible to compensate for different document types by creating separate indexes for each type and performing searches with a multi-index query. I haven’t tested it, but I think the scores from each query would be normalized by the coordination factor, so a top-scoring video would be given about the same weight as a top-scoring article. However, this approach would also calculate term frequencies for each content type individually, and I’m not sure how that would affect the results. Giving articles a small boost was a much simpler solution, especially since we already wanted to control how important each content type was for separate reasons.

Adding Domain-specific Logic — Sometimes you have domain-specific logic that is difficult for Lucene to discern on its own. For example, our review articles are probably one of the most important types of content on our site. Since our users are often looking for our reviews, we gave review articles a small boost so they would score higher than other articles.

Another example is stub wiki pages. Videos and objects are expected to have relatively short text descriptions. Articles are often longer, although sometimes we’ll have short articles that announce a bit of news or promote some other content, so a short article is okay. However, a short wiki page is often a sign of a stub, so it should score lower than other results. This is opposite of what Lucene would have done on its own — Lucene would have preferred a match in the shorter wiki page and scored it higher.

Incorporating Additional Signals — For the most part, the importance of a particular piece of content on our site fades with time. For example, a review for a game that was just released may be important this week, but less so next month and even less so a year from now. Out of the box, Lucene does not consider the recency/freshness of content in its scoring algorithm. But if recency factors heavily into scoring in your domain, you may want to incorporate it using a boost. (More details on how to implement a recency boost can be found later in this post.)

We boost our game, movie, and TV show objects if we have written/created more articles and videos about them. A more generic example of this might be boosting products that have been purchased more, or boosting articles that have more views or comments. Which attributes suggest importance is very domain-specific, so you have to handle it yourself with a boost.

Boosting at Index Time vs. Query Time

Boosts can be applied at index time when the document is indexed, or at query time when a search is performed. If a particular document will always be more important than others, you may want to consider applying the boost when the document is indexed. Pre-boosted documents are faster to search because there is less work to do when a search is performed. However, even if you know that a document will always be more important, you may not know how much more important. If the boost is too strong, the important document will always appear at the top of results (as long as it matched at all); if the boost is too weak, the important document will not get any real advantage over other documents.

Applying a boost at index time requires that you re-index the document to change the boost. Unless you are manually adding documents to your index and deciding the boost on a case-by-case basis, you likely have some kind of script or program that is building your index and the boosts are determined by some logic or a set of rules. A change in the logic or rules will likely affect many documents, so you will effectively need to rebuild your index for the changes to take effect. If your index is small, this may be appropriate. Our index takes several hours to rebuild, so we avoid applying boosts at index time whenever possible. Applying boosts at query time let us add new boosts, change boost criteria, and change boost strength on-the-fly. This flexibility is well worth the additional runtime cost.

Although you can combine boosts applied at index time and at query time, some boosts must be applied at query time because there isn’t enough information at index time to calculate the boost. For example, if you are doing a boost based on document freshness (how close the document’s timestamp is to the current time), the current time (when the search is performed) is not known at index time. In this particular example, you could use the time the document is indexed as the current time if the index is frequently rebuilt and you really want to avoid query time boosts.

Implementing Boosts

Almost every elasticsearch query type has a boost parameter that allows you to influence the weight of that query versus other queries, but we don’t use this parameter because we only have one main query. The main query is a query_string query, which parses the user’s query, finds matches, and scores them using Lucene’s default scoring algorithm. Then we apply a number of boosts depending on whether certain criteria is met.

In early prototypes, I accomplished the boosting by wrapping the main query in a custom_score query. As the name implies, the custom_score query allows you to calculate the score of each document using custom logic by passing in a script in the script parameter. By default, scripts are interpreted as MVEL, although other languages are supported. You can access the score assigned by the wrapped query via the special _score variable, so I started off doing something like this:

{
  "query": {
    "custom_score": {
      "query": { ...the main query... },
      "script": "_score * (doc['class'].value == 'review' ? 1.2 : 1)"
    }
  }
}

This worked, but it wasn’t very scalable. As I added more boosts, I would end up with an expression with dozens of terms. Also, each document field would have to be stored at index time so that the script can retrieve and evaluate it, which bloats the index and is relatively slow. Fortunately, there is a much better tool for the job — the custom_filters_score query. From the documentation:

A custom_filters_score query allows to execute a query, and if the hit matches a provided filter (ordered), use either a boost or a script associated with it to compute the score.

This can considerably simplify and increase performance for parameterized based scoring since filters are easily cached for faster performance, and boosting / script is considerably simpler.

Converted to a custom_filters_score query, the above example looks like this:

{
  "query": {
    "custom_filters_score": {
      "query": { ...the main query... },
      "filters": [
        {
          "filter": {
            "term": {
              "class": "review"
            }
          },
          "boost": 1.2
        }
      ]
    }
  }
}

If you want to add additional boosts, just add another filter specifying the criteria and assigning it a boost. You can use any filter, including filters that wrap other filters like the and filter. If you have multiple filters, you may want to specify how multiple matching filters will be combined by passing the score_mode parameter. By default, the first matching filter’s boost is used, but if you have multiple filters that may match you can set score_mode to something like multiply which would apply all the boosts.

The following query boosts reviews by 20%, boosts articles by 20% (so review articles would be boosted 44%), and penalizes wiki pages that are less than 600 characters long by 80%.

{
  "query": {
    "custom_filters_score": {
      "query": { ...the main query... },
      "filters": [
        {
          "filter": {
            "term": {
              "class": "review"
            }
          },
          "boost": 1.2
        },
        {
          "filter": {
            "term": {
              "type": "article"
            }
          },
          "boost": 1.2
        },
        {
          "filter": {
            "and": [
              {
                "term": {
                  "type": "page"
                }
              },
              {
                "range": {
                  "descriptionLength": {
                    "to": 600
                  }
                }
              }
            ]
          },
          "boost": 0.2
        }
      ],
      "score_mode": "multiply"
    }
  }
}

Variable Boosts

Sometimes you want to adjust the strength of a boost based on a field in the document. For example, if you want to boost recent documents, an article published today should be boosted more than an article published yesterday, and an article published yesterday should be boosted more than an article published last week, etc. Even though filters are cached and run relatively quickly, it would be impractical to have a filter for articles published today, another filter for articles published yesterday, another filter for articles published last week, etc. Fortunately, the custom_filters_score query can accept a script parameter instead of a boost for these situations.

In his presentation, Boosting Documents in Solr by Recency, Popularity and Personal Preferences, Timothy Potter talks about using solr‘s recip function to calculate the boost value for recent documents. Unfortunately elasticsearch doesn’t have a recip function, but you can easily implement the same underlying function, y = a / (m * x + b), and pass it to elasticsearch as a script.

In the example below, I’m using the values of m, a, and b suggested on slide 7: m=3.16E-11, a=0.08, and b=0.05. Since some documents in our index have dates in the future, I added abs() to take the absolute value of the difference between the time the query is run and the document’s timestamp. I’m also adding 1.0 to the boost value to make it a freshness boost instead of a staleness penalty.

{
  "query": {
    "custom_filters_score": {
      "query": { ...the main query... },
      "params": {
        "now": ...current time when query is run, expressed as milliseconds since the epoch...
      },
      "filters": [
        {
          "filter": {
            "exists": {
              "field": "date"
            }
          },
          "script": "(0.08 / ((3.16*pow(10,-11)) * abs(now - doc['date'].date.getMillis()) + 0.05)) + 1.0"
        }
      ]
    }
  }
}

With these values, documents dated right now are boosted up to 160% (boost value is 2.6). This falls off to a 100% boost after 10 days, 60% after a month, 15% after 6 months, 8% after a year, and less than 4% after 2 years. (See this graph.)

Final note — elasticsearch scripts are cached for faster execution. When using scripts with elasticsearch, pass in values that change from query to query as a parameter via the params parameter rather than doing string interpolation in the script itself. This way, the script stays constant and cacheable, but your parameter still changes with every query.

Tagged ,

12 thoughts on “Advanced Scoring in elasticsearch

  1. Really nice post, it helped understand boosting, scores and the last code about boosting recent items was incredibly helpful!

  2. Thanks for posting this! Just switched from Solr to ElasticSearch and was looking for a good explanation on how to boost documents by date in ES, so this definitely hit the spot. Thanks again!

  3. Great post,
    I will try query time boost like you do instead of index time, each time I was concerned about ES performances, I’ve been hit by a useless premature optimization wall: there should be no real impact on performances :-)

  4. I’m running into issue with the main query part. When I include the post_date as one of my fields and search using a string, I get this error:

    “error” : “ElasticSearchException[Couldn't parse query from source.]; nested: ElasticSearchParseException[failed to parse date field [link], tried both date format [YYYY-MM-dd HH:mm:ss], and timestamp number]; nested: IllegalArgumentException[Invalid format: \"link\"]; “,

    I can see that elasticsearch cannot convert the string into a date.

    When I remove the the post_date from the fields then it seems like it’s not available for use by the script anymore getting this error:

    “error” : “CompileException[[Error: No field found for [org.elasticsearch.index.fielddata.ScriptDocValues$Longs@33de6c02] in mapping with types [post]]\n[Near : {... *pow(10,-11)) * abs(now - doc[post_date].date.getM ….}]\n ^\n[Line: 1, Column: 35]]; nested: ElasticSearchIllegalArgumentException[No field found for [org.elasticsearch.index.fielddata.ScriptDocValues$Longs@33de6c02] in mapping with types [post]]; “,

    Here is the gist to the curl I’m using:
    https://gist.github.com/greatwitenorth/e62034b990fbaba0863e

    any ideas?

    • The first error sounds like elasticsearch is encountering the string “link” in the date field. Can you retrieve a document from your index and verify what’s in the date field?

      I’m not sure about the second error — if the field doesn’t exist in the document the script shouldn’t be run, so I don’t know why it’s complaining “no field found”. Maybe I’m misinterpreting the error.

      • Well, finally figured out the issue. I was using single quotes wrapped around the post_date and my curl was also using single quote to wrap the whole request. I changed it to double quotes and escaped them and it work fine now.

  5. Hi, I’m finding a way to boost term at first position.
    Ex: search “abc”, record “abc def” will have higher score than “012 abc def”.
    Is there anyway to boost it at index time (similarity ?) ?
    Or using another query rather than Span queries, which quite slow in my case ?

  6. Great post, thanks!
    Minor question: what’s the use of abs() function? Do you really want to give the same boost to document published yesterday and to the document whose published date is tomorrow? Did you mean max(0,x) instead of abs(x)? Or maybe it has no importance if you want anyway to filter documents with date<=now?

    • I used abs() because in this case there were some documents (games) that had release dates in the future. If a game was announced for release 9 months from now, it shouldn’t be boosted very much, but if the release is next week, it should be boosted. Same goes for a game that was released 9 months ago.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>