Lucene Scoring and elasticsearch’s _all Field

At work, I’ve been building a search application using elasticsearch. In my first post on the topic, I talked about indexing and analysis and ignored scoring entirely. Scoring is a very complex topic, but I’ll try to address some of the basics in this post. I’ll also cover a specific scoring-related issue with elasticsearch’s _all field.

Scoring

elasticsearch is powered by the Lucene search library. When you send Lucene a query, it first finds all the documents that match the query. Then, it assigns a score to each of the matching documents. More technically, Lucene’s documentation states:

Lucene combines Boolean model (BM) of Information Retrieval with Vector Space Model (VSM) of Information Retrieval – documents “approved” by BM are scored by VSM.

A higher score indicates that the document is more relevant to the query. By default, results are ordered from highest score to lowest score.

Lucene’s Scoring Algorithm

The scoring algorithm is quite complex. It is described in detail in Lucene’s documentation. If you ignore all the technical jargon and formulas, these are the general rules:

  • Documents that have more occurrences of a given term receive a higher score
  • Rarer terms give higher contribution to the total score
  • A document that contains more of the query’s terms will receive a higher score than another document with fewer query terms
  • Shorter fields contribute more to the score

Boosting

In addition to these rules, you can assign boosts to different document fields or query terms to influence the scoring algorithm. For example, the documents I’m working with have a primary name field and some secondary name fields. If I wanted a document that matches the primary name field to rank higher than a document that matches a secondary name field, I can assign a boost to the primary name field to make it count more toward the overall score. This can be done both at index time or at query time. (I cover boosting in more detail in this post.)

elasticsearch’s _all Field

By default, every elasticsearch document includes an _all field. If a search query does not specify what field(s) to query, the query will be run against the _all field.

The idea of the _all field is that it includes the text of one or more other fields within the document indexed. It can come very handy especially for search requests, where we want to execute a search query against the content of a document, without knowing which fields to search on.

Further,

One of the nice features of the _all field is that it takes into account specific fields boost levels. Meaning that if a title field is boosted more than content, the title (part) in the _all field will mean more than the content (part) in the _all field.

Sounds great, right? Unfortunately, there are a few caveats.

Index Size — The _all field is just a regular field that happens to includes the text of every other field in the document, so leaving it enabled increases the size of the index. In one of the indexes I’ve been working with, disabling the _all field reduced the size by 30%. (Note that the _all field copies the text from the other fields and analyzes them again; it doesn’t copy the pre-analyzed tokens. You can set a separate analyzer for the _all field.)

Highlighting — The _all field is also useless for highlighting search results (showing snippets of the document that matched the search query). From the _all field documentation:

For any field to allow highlighting it has to be either stored or part of the _source field. By default _all field does not qualify for either, so highlighting for it does not yield any data.

Although it is possible to store the _all field, it is basically an aggregation of all fields, which means more data will be stored, and highlighting it might produce strange results.

Length Norms — Recall the last bullet point in my description of the Lucene scoring algorithm above: Shorter fields contribute more to the score. If I have a document with a short title field and a longer body field, Lucene will automatically give title matches more weight. Since the _all field is just one big field, all matches are weighted equally.

Here’s a shell snippet that adds two such documents to an index:

# add a document to the index with "foobar" in the title
curl -X PUT 'http://localhost:9200/test/docs/1' -d '{
  "title": "foobar in title",
  "body": "text in the body, which is longer"
}'

# add a document to the index with "foobar" in the body
curl -X PUT 'http://localhost:9200/test/docs/2' -d '{
  "title": "text in title",
  "body": "foobar in the body, which is longer"
}'

And here’s a snippet to query the title and body fields for “foobar”.

# query title and body fields
curl -X POST 'http://localhost:9200/test/docs/_search?pretty' -d '{
  "query": {
    "query_string": {
      "query": "foobar",
      "fields": [
        "title",
        "body"
      ]
    }
  }
}'

As expected, the document with “foobar” in the title has a higher score.

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.058849156,
    "hits" : [ {
      "_index" : "test",
      "_type" : "docs",
      "_id" : "1",
      "_score" : 0.058849156,
      "_source" : {
        "title": "foobar in title",
        "body": "text in the body, which is longer"
      }
    }, {
      "_index" : "test",
      "_type" : "docs",
      "_id" : "2",
      "_score" : 0.047079325,
      "_source" : {
        "title": "text in title",
        "body": "foobar in the body, which is longer"
      }
    } ]
  }
}

If instead the query is against the default _all field…

# query the _all field
curl -X POST 'http://localhost:9200/test/docs/_search?pretty' -d '{
  "query": {
    "query_string": {
      "query": "foobar"
    }
  }
}'

Both documents are assigned the same score.

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.11506981,
    "hits" : [ {
      "_index" : "test",
      "_type" : "docs",
      "_id" : "1",
      "_score" : 0.11506981,
      "_source" : {
        "title": "foobar in title",
        "body": "text in the body, which is longer"
      }
    }, {
      "_index" : "test",
      "_type" : "docs",
      "_id" : "2",
      "_score" : 0.11506981,
      "_source" : {
        "title": "text in title",
        "body": "foobar in the body, which is longer"
      }
    } ]
  }
}

Conclusion

I have only begun to experiment with scoring. Until I am certain I want to override certain behaviors, I want to preserve as much of Lucene’s scoring algorithm as possible. This includes length norms. Also, I intend to highlight certain fields in my search results. As such, I have written my search queries to specify each field I want to search on instead of relying on the default _all field. To save space in the index, I’ve disabled the _all field entirely at index-time.

Tagged ,

2 thoughts on “Lucene Scoring and elasticsearch’s _all Field

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>