Favouring exact matches in Elasticsearch

Published on Thursday, 25 October 2018 by Russ Cam

I recently came across a question on Stack Overflow asking about Boosting elasticsearch results with NEST when a secondary field is a specific value. I thought the question was interesting enough to warrant a blog post, the first I've written in a while!

Essentially, the question is

  1. When searching for "Double Choc" ice cream description, expect the most relevant result to be a document with the exact description "Double Choc", followed by those including "Double Choc"
  2. When searching for "Ben & Jerries Double Choc" (sic), expect the most relevant result to be a document with the exact description "Ben & Jerries Double Choc", followed by those including "Double Choc"

Now, the documents in question look as follows:

public class IceCream
{
    public string Description {get; set;}
    public bool IsGeneric { get; set; }
    public double Price { get; set; }
}

The IsGeneric property can be used to mark an IceCream as a generic brand. The answer given on the question, which aligns with the approach that the question asker is wanting to take, is to slightly negatively boost those documents that have IsGeneric set to false, in order to slightly favour generic ice cream brands. This will work OK, but I think there are more/better signals to improve relevancy.

Signal modelling

There are a few different signals we may wish to incorporate into our search relevancy solution

Disabling norms

norms, or normalization factors as it is short for, is the inverse square root of the number of terms in a field, and contributes to the relevancy score calculation for a field. Concretely, a term appearing in a shorter length field will apply a higher normalization factor to the relevancy score than a term appearing in a longer field, yielding a higher relevancy for a match in a short field than a longer one. This can be useful when searching across multiple fields of disparate length, for example, the chapter title of a book and also the content of the chapter, or a single field containing values of disparate length. Often however, particularly for queries on single fields, it is unwanted.

For this example, disabling norms on the description field is probably a good idea, since a match on the description field for the user query "Double Choc" should produce the same score for a document with the description "Cole's Double Choc Ice cream" as one with the description "Willy Wonka's Marvellous Neverending Double Choc Chip Ice Cream"; both descriptions contain the query "Double Choc" and we wouldn't want to favour the former over the latter purely because it has a shorter description.

Custom analyzers

My read on the question is that it seems to really be about favouring exact matches to a query over partial matches of one or more terms within the query. Often when people talk about exact matches, they don't actually mean exact matches; some leniency in matching may be desired such as

  1. lowercasing characters to make search case-insensitive
  2. removing accents and other diacritics by folding characters into their ASCII or Unicode character equivalent, minus the diacritics e.g. è, à, ù into e, a and u, respectively
  3. filtering characters to expand symbols into their common word counterparts e.g. & into and
  4. filtering characters that may often be skipped or misused e.g. apostrophes '
  5. catering for domain-specific common misspellings e.g. "Jerry's" in the question, misspelled as "Jerries"

This is where analysis and custom analyzers come in, and it is not uncommon to wish to analyze a given field in more than one way, to satisfy different search needs.

Combining queries

The match query is a good default search query to start with when looking to implement full-text search with Elasticsearch. For the question here however, a single match query is going to be an insufficient tool to hone in on the two, potentially three, search signals in play. The first signal is that of an exact match to the user query. The second is a match for one or more terms in the user query. And the third potential signal is whether a given document represents a generic ice cream brand; if it does, give it a slight boost over a branded ice cream.

To incorporate these different signals into a single query, we'll need to compose a composite query containing three different queries, and then control how the scores calculated for each different query contribute to the overall score for a given document. The bool query is our friend here.

There is another potential signal that we may wish to incorporate as well; when a set of matching documents have the same relevancy score, we may wish to add a small element of randomness to the order in which these documents appear within the results, to provide an element/perception of freshness to them. If someone is paying us to serve up their particular ice cream brands, they might not want to be the brand that is always listed last amongst an equally scored set of documents! The function_score query's random_score function can help us with this.

An example with NEST

Let's put the previous pieces together into an example. For this, I'm going to use the high level .NET Elasticsearch client, NEST, to make requests to Elasticsearch, and I'm going to be running them against Elasticsearch 5.6.6

Creating the index, analyzers and mappings

First let's instantiate the client and create an index

var defaultIndex = "icecreams";

var settings = new ConnectionSettings(new Uri("http://localhost:9200"))
    .DefaultIndex(defaultIndex);

var client = new ElasticClient(settings);

// to allow the example to be re-run
if (client.IndexExists(defaultIndex).Exists)
    client.DeleteIndex(defaultIndex);

client.CreateIndex(defaultIndex, c => c
    .Settings(s => s
        .NumberOfShards(1)
        .NumberOfReplicas(0)
        .Analysis(a => a
            .Analyzers(an => an
                .Custom("exact_icecream", ca => ca
                    .CharFilters("convert_ampersand", "remove_apostrophes")
                    .Tokenizer("keyword")
                    .Filters("lowercase", "jerrys")
                )
                .Custom("standard_icecream", ca => ca			    
                    .CharFilters("remove_apostrophes")
                    .Tokenizer("standard")
                    .Filters("standard", "lowercase", "jerrys", "choc")
                )
            )
            .CharFilters(cf => cf
                .Mapping("convert_ampersand", mf => mf
                    .Mappings("& => and")
                )
                .Mapping("remove_apostrophes", mf => mf
                    .Mappings(
                        "\\u0091=>",
                        "\\u0092=>",
                        "\\u2018=>",
                        "\\u2019=>",
                        "\\u201B=>",
                        "\\u0027=>"
                    )
                )
            )
            .TokenFilters(tf => tf
                .PatternReplace("jerrys", pr => pr
                    .Pattern(@"(\b?)jerries(\b?)")
                    .Replacement(@"$1jerrys$2")
                )
                .Synonym("choc", sf => sf
                    .Synonyms(
                        "choc, chocolate"
                    )
                )
            )
        )
    )
    .Mappings(m => m
        .Map<IceCream>(mm => mm
            .AutoMap()
            .Properties(p => p
                .Text(t => t
                    .Name(n => n.Description)
                    .Norms(false)
                    .Analyzer("standard_icecream")
                    .Fields(f => f
                        .Text(tt => tt
                            .Name("exact")
                            .Analyzer("exact_icecream")
                            .Norms(false)
                        )
                    )
                )
            )
        )
    )
);

There's a few things going on in this request:

  1. The number of shards is set to 1 and replicas to 0. Relevancy scores are calculated *per shard,*meaning that an index composed of several shards distributed across multiple nodes within an Elasticsearch cluster can yield different relevancy scores for documents within each individual shard, when compared to the relevancy score that may be calculated when looking at the entire corpus of documents. In practice, with an even distribution of documents amongst shards, differences in relevancy scores across shards tend to diminish. For the purposes of this example, and likely for 100,000 small documents as stated in the question, a single shard will be enough. If search throughput and redundancy is needed, this can be achieved by adding replicas.
  2. A custom analyzer, exact_icecream, is created and will be used to find exact matches. We want to perform a little normalization on the input to
    1. convert & to the word and using a character filter
    2. remove apostrophes using a character filter
    3. tokenize the input as one term using the keyword tokenizer
    4. lowercase the term
    5. replace incorrect spelling jerries with jerrys. The pattern_replace token filter uses optional word boundaries in the regular expression for jerries because it's used as a token filter on terms that may contain the word "jerries" as in the exact_icecream analyzer, or the entire term is the word "jerries", as in the following standard_icecream analyzer
  3. A custom analyzer, standard_icecream, is created and will be used for analysis on the description field. This uses the standard tokenizer to tokenize the input into terms, then will
    1. lowercase terms
    2. replace incorrect spelling "jerries" with "jerrys"
    3. add synonyms for "choc" and "chocolate"
  4. Map the IceCream type using automapping, then override the inferred mapping for description to
    1. disable norms
    2. use the standard_icecream analyzer on the description field
    3. map the description field as a multi-field with a text sub-field, exact, that also disables norms will use the exact_icecream analyzer

The index request sent ends up as

PUT http://localhost:9200/icecreams 
{
  "settings": {
    "index.number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "exact_icecream": {
          "type": "custom",
          "char_filter": [
            "convert_ampersand",
            "remove_apostrophes"
          ],
          "filter": [
            "lowercase",
            "jerrys"
          ],
          "tokenizer": "keyword"
        },
        "standard_icecream": {
          "type": "custom",
          "char_filter": [
            "remove_apostrophes"
          ],		  
          "filter": [
            "standard",
            "lowercase",
            "jerrys",
            "choc"
          ],
          "tokenizer": "standard"
        }
      },
      "char_filter": {
        "convert_ampersand": {
          "type": "mapping",
          "mappings": [
            "& => and"
          ]
        },
        "remove_apostrophes": {
          "type": "mapping",
          "mappings": [
            "\\u0091=>",
            "\\u0092=>",
            "\\u2018=>",
            "\\u2019=>",
            "\\u201B=>",
            "\\u0027=>"
          ]
        }
      },
      "filter": {
        "jerrys": {
          "type": "pattern_replace",
          "pattern": "(\\b?)jerries(\\b?)",
          "replacement": "$1jerrys$2"
        },
        "choc": {
          "type": "synonym",
          "synonyms": [
            "choc, chocolate"
          ]
        }
      }
    },
    "index.number_of_shards": 1
  },
  "mappings": {
    "icecream": {
      "properties": {
        "description": {
          "type": "text",
          "fields": {
            "exact": {
              "type": "text",
              "analyzer": "exact_icecream",
              "norms": false
            }
          },
          "analyzer": "standard_icecream",
          "norms": false
        },
        "isGeneric": {
          "type": "boolean"
        },
        "price": {
          "type": "double"
        }
      }
    }
  }
}

As an aside, if you wish to observe the requests and responses to Elasticsearch whilst developing, a good way is to output them somewhere such as standard output or trace, using the OnRequestCompleted() method on ConnectionSettings

var defaultIndex = "icecreams";

var settings = new ConnectionSettings(new Uri("http://localhost:9200"))
    .DefaultIndex(defaultIndex)
    .DisableDirectStreaming()
    .PrettyJson()
    .OnRequestCompleted(callDetails =>
    {
        if (callDetails.RequestBodyInBytes != null)
        {
            Console.WriteLine(
                $"{callDetails.HttpMethod} {callDetails.Uri} \n" +
                $"{Encoding.UTF8.GetString(callDetails.RequestBodyInBytes)}");
        }
        else
        {
            Console.WriteLine($"{callDetails.HttpMethod} {callDetails.Uri}");
        }

        Console.WriteLine();

        if (callDetails.ResponseBodyInBytes != null)
        {
            Console.WriteLine($"Status: {callDetails.HttpStatusCode}\n" +
                     $"{Encoding.UTF8.GetString(callDetails.ResponseBodyInBytes)}\n" +
                     $"{new string('-', 30)}\n");
        }
        else
        {
            Console.WriteLine($"Status: {callDetails.HttpStatusCode}\n" +
                     $"{new string('-', 30)}\n");
        }
    });

var client = new ElasticClient(settings);

Analyzing the analyzers

Now that the index is created, We can check what output our analyzers will yield for a given input using the Analyze API.

For the description field

var doubleChocQuery = "Double Choc";
var benAndJerrysQuery = "Ben & Jerries Double Choc";
    
client.Analyze(a => a
    .Index(defaultIndex)
    .Field<IceCream>(f => f.Description)
    .Text(doubleChocQuery)
);

client.Analyze(a => a
    .Index(defaultIndex)
    .Field<IceCream>(f => f.Description)
    .Text(benAndJerrysQuery)
);

we see the responses

{
  "tokens" : [
    {
      "token" : "double",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "choc",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "chocolate",
      "start_offset" : 7,
      "end_offset" : 11,
      "type" : "SYNONYM",
      "position" : 1
    }
  ]
}

and

{
  "tokens" : [
    {
      "token" : "ben",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "jerrys",
      "start_offset" : 6,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "double",
      "start_offset" : 14,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "choc",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "chocolate",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "SYNONYM",
      "position" : 3
    }
  ]
}

The standard tokenizer has tokenized the text into terms according to the rules laid out in Unicode® Standard Annex #29. Notice also that the synonym token "chocolate" has been included in the same position as "choc" in both responses. Additionally, the token "jerrys" is included where "Jerries" appeared in the input.

The analysis for the description.exact field is quite different

client.Analyze(a => a
    .Index(defaultIndex)
    .Field<IceCream>(f => f.Description.Suffix("exact"))
    .Text(doubleChocQuery)
);

client.Analyze(a => a
    .Index(defaultIndex)
    .Field<IceCream>(f => f.Description.Suffix("exact"))
    .Text(benAndJerrysQuery)
);

yields

{
  "tokens" : [
    {
      "token" : "double choc",
      "start_offset" : 0,
      "end_offset" : 11,
      "type" : "word",
      "position" : 0
    }
  ]
}

and

{
  "tokens" : [
    {
      "token" : "ben and jerrys double choc",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "word",
      "position" : 0
    }
  ]
}

The character filters have replaced the & with the word and, the keyword tokenizer produced only one token in each case,  and the token filters have applied their logic. This is looking like a reasonable output for exact matching. If we try a few different inputs for Ben & Jerry's Double Choc such as

var variations = new []
{
    "Ben & Jerry's Double Choc",
    "Ben and Jerry's Double Choc",
    "BEN AND JERRYS DOUBLE CHOC",
    "Ben and Jerries Double Choc",
    "Ben & Jerries Double Choc",
};

foreach(var variation in variations)
{
    client.Analyze(a => a
        .Index(defaultIndex)
        .Field<IceCream>(f => f.Description.Suffix("exact"))
        .Text(variation)
    );
}

we'll see the same token output for each.

Now let's bulk index some documents

var icecreams = new[] {
    new IceCream { Description = "Double Choc", IsGeneric = true, Price = 9.99 },
    new IceCream { Description = "Ben & Jerries Double Choc", IsGeneric = false, Price = 9.99 },
    new IceCream { Description = "Fairy Farms Double Choc", IsGeneric = false, Price = 9.99 },
    new IceCream { Description = "Dan's Double Chocolate", IsGeneric = false, Price = 9.99 },
};

client.Bulk(b => b.IndexMany(icecreams).Refresh(Refresh.WaitFor));

Now search with the input Double Choc using the following query

var random = new Random();

client.Search<IceCream>(s => s
    .Query(q => q
        .FunctionScore(fs => fs
            .Query(fsq => fsq    
                .Bool(b => b
                    .Must(m => m
                        .Match(mp => mp
                            .Field(f => f.Description.Suffix("exact"))
                            .Query(doubleChocQuery)
                            .Boost(5)
                        ) || m
                        .Match(ma => ma
                            .Field(f => f.Description)
                            .Query(doubleChocQuery)
                        )
                    )
                )
            )
            .Functions(fu => fu
                .RandomScore(rs => rs
                    .Seed(random.Next())
                )
            )
            .ScoreMode(FunctionScoreMode.Sum)
        )
    )
);

which produces the search query

POST http://localhost:9200/icecreams/icecream/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "match": {
                      "description.exact": {
                        "boost": 5.0,
                        "query": "Double Choc"
                      }
                    }
                  },
                  {
                    "match": {
                      "description": {
                        "query": "Double Choc"
                      }
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "functions": [
        {
          "random_score": {
            "seed": 1308942479
          }
        }
      ],
      "score_mode": "sum"
    }
  }
}

and the results

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 4.9569297,
    "hits" : [
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3A",
        "_score" : 4.9569297,
        "_source" : {
          "description" : "Double Choc",
          "isGeneric" : true,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3B",
        "_score" : 0.24639045,
        "_source" : {
          "description" : "Ben & Jerries Double Choc",
          "isGeneric" : false,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3C",
        "_score" : 0.16460618,
        "_source" : {
          "description" : "Fairy Farms Double Choc",
          "isGeneric" : false,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3D",
        "_score" : 0.049376354,
        "_source" : {
          "description" : "Dan's Double Chocolate",
          "isGeneric" : false,
          "price" : 9.99
        }
      }
    ]
  }
}

Before discussing the results, let's also look for matches for Ben & Jerries Double Choc

client.Search<IceCream>(s => s
    .Query(q => q
        .FunctionScore(fs => fs
            .Query(fsq => fsq    
                .Bool(b => b
                    .Must(m => m
                        .Match(mp => mp
                            .Field(f => f.Description.Suffix("exact"))
                            .Query(benAndJerrysQuery)
                            .Boost(5)
                        ) || m
                        .Match(ma => ma
                            .Field(f => f.Description)
                            .Query(benAndJerrysQuery)
                        )
                    )
                )
            )
            .Functions(fu => fu
                .RandomScore(rs => rs
                    .Seed(random.Next())
                )
            )
            .ScoreMode(FunctionScoreMode.Sum)
        )
    )
);

which yields the following request and response

POST http://localhost:9200/icecreams/icecream/_search
{
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "bool": {
                "should": [
                  {
                    "match": {
                      "description.exact": {
                        "boost": 5.0,
                        "query": "Ben & Jerries Double Choc"
                      }
                    }
                  },
                  {
                    "match": {
                      "description": {
                        "query": "Ben & Jerries Double Choc"
                      }
                    }
                  }
                ]
              }
            }
          ]
        }
      },
      "functions": [
        {
          "random_score": {
            "seed": 874175983
          }
        }
      ],
      "score_mode": "sum"
    }
  }
}

// Response

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 0.8141881,
    "hits" : [
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3B",
        "_score" : 0.8141881,
        "_source" : {
          "description" : "Ben & Jerries Double Choc",
          "isGeneric" : false,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3D",
        "_score" : 0.22804688,
        "_source" : {
          "description" : "Dan's Double Chocolate",
          "isGeneric" : false,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3A",
        "_score" : 0.18919782,
        "_source" : {
          "description" : "Double Choc",
          "isGeneric" : true,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3C",
        "_score" : 0.10123283,
        "_source" : {
          "description" : "Fairy Farms Double Choc",
          "isGeneric" : false,
          "price" : 9.99
        }
      }
    ]
  }
}

This is a fairly involved query, so let's break it down:

The outer function_score query allows us to run a query and then run one or more functions on the resulting documents from the query to compute a new relevancy score. function_score query can be great for using features of the documents themselves in order to influence the relevancy score. In this example, only a single random_score function is applied to compute a random number between 0 and 1 for each document. The function is seeded with a random integer, but in a real system the seed could be the user ID of the logged in user or similar. A small amount of randomness means that same scoring documents will not always appear in the same order in the results. From the two search queries above, documents in second, third and fourth position in each response are in a different order and have slightly different scores compared to each other. If the random_score function were taken out,  you'd see that the scores for second, third and fourth position in each response are the same.

The query part of the function_score query is a bool query composed of a single must clause, a bool query with two should clauses. The two clauses are a match query on the description.exact field with a large boost of 5, and a match query on the description field. For a document to be considered a hit for this query, it needs to satisfy at least one of the match queries, and the large boost on the match query on the description.exact field means that a match here will contribute significantly to the relevancy score computed for the document. You might be wondering why a bool query with two should clauses is nested in a bool query must clause? You'd be right to wonder as it's not actually necessary in this example, but I'll explain why it's written this way shortly.

For both sets of search results, the document with an exact matchon description for the input is the top result, followed by the three documents that partially match, in a slightly random order. You can see that the score for the top result is considerably larger than the score for documents in second position onwards, due to the boost applied to the match query on the description.exact field that's using the exact_icecream analyzer.

Referring back to the search signals, the original question asker and answerer were looking at favouring generic brands by slightly boosting them over branded ice creams, or inversely, slightly negatively boosting non-generic brands. Coming back to how the search query has been constructed, we can add this signal in quite easily, by adding a term query as a should clause on the outer most bool query

client.Search<IceCream>(s => s
    .Query(q => q
        .FunctionScore(fs => fs
            .Query(fsq => fsq    
                .Bool(b => b
                    .Must(m => m
                        .Match(mp => mp
                            .Field(f => f.Description.Suffix("exact"))
                            .Query(benAndJerrysQuery)
                            .Boost(5)
                        ) || m
                        .Match(ma => ma
                            .Field(f => f.Description)
                            .Query(benAndJerrysQuery)
                        )
                    )
                    .Should(ss => ss
                        .Term(t => t.IsGeneric, true)
                    )
                )
            )
            .Functions(fu => fu
                .RandomScore(rs => rs
                    .Seed(random.Next())
                )
            )
            .ScoreMode(FunctionScoreMode.Sum)
        )
    )
);

This yields the results

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 5.970637,
    "hits" : [
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3B",
        "_score" : 5.970637,
        "_source" : {
          "description" : "Ben & Jerries Double Choc",
          "isGeneric" : false,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3A",
        "_score" : 1.1512128,
        "_source" : {
          "description" : "Double Choc",
          "isGeneric" : true,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3C",
        "_score" : 0.24227849,
        "_source" : {
          "description" : "Fairy Farms Double Choc",
          "isGeneric" : false,
          "price" : 9.99
        }
      },
      {
        "_index" : "icecreams",
        "_type" : "icecream",
        "_id" : "AWESBtI7-3-iuPJ3Qs3D",
        "_score" : 0.0754106,
        "_source" : {
          "description" : "Dan's Double Chocolate",
          "isGeneric" : false,
          "price" : 9.99
        }
      }
    ]
  }
}

As expected, the exact match document appears top followed by the generic brand in position two. Here, the should clause with the term query acts as a boosting signal for the relevancy score. That is, a document need only match the must clause but if it also matches the should clause, then it will receive a slightly higher relevancy score. I don't know if this  term query provides much value compared to the match queries that we have, but it's interesting to see how it affects results. If we did intend to use it, we may want to hone down the influence of the random_score function on the overall score, to allow the term query's influence to come through more prominently.

I hope this has been useful in exploring some of the features available within Elasticsearch to satisfy your search needs :) I've added all the code as a gist for you to play with


Comments

comments powered by Disqus