Forloop - Querying values in an array field in Elasticsearch

Elasticsearch allows indexing multiple values into a field, and to query on that field to find documents with matching values. We'll use the following documents indexed into Elasticsearch as an example

POST /my-data/_bulk?refresh=wait_for
{"index":{}}
{"values":[1.02, 1.03, 1.04, 0.0]}
{"index":{}}
{"values":[0.0, 1.02, 1.03]}
{"index":{}}
{"values":[0.0, -1.0]}

Terms query

Running a terms query, we can retrieve documents that have the values 0.0 or 1.02 or 1.03 in the "values" field

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "terms": {
      "values": [0.0, 1.02, 1.03]
    }
  }
}

This returns all three documents

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "values" : [
            1.02,
            1.03,
            1.04,
            0.0
          ]
        }
      },
      {
        "_source" : {
          "values" : [
            0.0,
            1.02,
            1.03
          ]
        }
      },
      {
        "_source" : {
          "values" : [
            0.0,
            -1.0
          ]
        }
      }
    ]
  }
}

Combined term queries

If we want to get just those documents that have the values 0.0 and 1.02 and 1.03 in the "values" field, we can combine three term queries in a bool query

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "values": 0.0
          }
        },
        {
          "term": {
            "values": 1.02
          }
        },
        {
          "term": {
            "values": 1.03
          }
        }
      ]
    }
  }
}

which yields

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "values" : [
            1.02,
            1.03,
            1.04,
            0.0
          ]
        }
      },
      {
        "_source" : {
          "values" : [
            0.0,
            1.02,
            1.03
          ]
        }
      }
    ]
  }
}

Terms set query

If we want to get just those documents that have the values 0.0 and 1.02 and 1.03 in the "values" field, and only these values and no others, we can use the terms_set query

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "terms_set": {
      "values": {
        "terms": [0.0, 1.02, 1.03],
        "minimum_should_match_script": {
          "source": "Math.max(params.num_terms, doc['values'].size())"
        }
      }
    }
  }
}

which yields

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "values" : [
            0.0,
            1.02,
            1.03
          ]
        }
      }
    ]
  }
}

Using the maximum value of either the count of values in the individual document or the number of terms passed in the query as the minimum number of terms that must match, limits the documents only to those with a count of values and values that match those passed

Querying individual elements in an array field

What if we wanted to query for documents that have a value of 0.0 at the last index in the "values" array field?

Script query

We might try to tackle this by using a script query and checking that the last value in doc values is 0

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "script": {
      "script": {
        "source": "doc['values'][doc['values'].size() - 1] == 0"
      }
    }
  }
}

But this returns

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "values" : [
            0.0,
            -1.0
          ]
        }
      }
    ]
  }
}

When we would have expected it to return a single document with the values [1.02, 1.03, 1.04, 0.0]. Using Painless scripting's Debug.explain may provide more details

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "script": {
      "script": {
        "source": "Debug.explain(doc['values'])"
      }
    }
  }
}

which yields (removing some details for brevity)

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "painless_class" : "org.elasticsearch.index.fielddata.ScriptDocValues.Doubles",
        "to_string" : "[0.0, 1.0199999809265137, 1.0299999713897705, 1.0399999618530273]",
        "java_class" : "org.elasticsearch.index.fielddata.ScriptDocValues$Doubles",
        "script_stack" : [
          "Debug.explain(doc['values'])",
          "                 ^---- HERE"
        ],
        "script" : "Debug.explain(doc['values'])",
        "lang" : "painless",
        "position" : {
          "offset" : 17,
          "start" : 0,
          "end" : 28
        }
      }
    ]
  },
  "status" : 400
}

The two most important lines are the painless_class and to_string, which tell us the type of doc['values'] and the values it holds, respectively. We're dealing with ScriptDocValues.Doubles, a collection type for double values that come from the Doc values column-oriented data structure, and the results show that the values are retrieved from doc values in ascending order. In other words, any information related to the original positions of values in the "values" field of the indexed document are lost, which is why the document with values [0.0, -1.0] was returned in the results, even though -1.0 was the last value in the field.

Script query with _source

Doc values does not preserve positions, but the original _source document does, so we can try querying this

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "script": {
      "script": {
        "source": "params._source['values'][params._source['values'].size() - 1] == 0"
      }
    }
  }
}

This however returns a null pointer exception

{
  "error" : {
    "root_cause" : [
      {
        "type" : "script_exception",
        "reason" : "runtime error",
        "script_stack" : [
          "params._source['values'][params._source['values'].size() - 1] == 0",
          "      ^---- HERE"
        ],
        "script" : "params._source['values'][params._source['values'].size() - 1] == 0",
        "lang" : "painless",
        "position" : {
          "offset" : 6,
          "start" : 0,
          "end" : 66
        }
      }
    ],
    "type" : "search_phase_execution_exception",
    "reason" : "all shards failed",
    "phase" : "query",
    "grouped" : true,
    "failed_shards" : [
      {
        "shard" : 0,
        "index" : "my-data",
        "node" : "AEw1g-tvQj6PKnBZC2Boag",
        "reason" : {
          "type" : "script_exception",
          "reason" : "runtime error",
          "script_stack" : [
            "params._source['values'][params._source['values'].size() - 1] == 0",
            "      ^---- HERE"
          ],
          "script" : "params._source['values'][params._source['values'].size() - 1] == 0",
          "lang" : "painless",
          "position" : {
            "offset" : 6,
            "start" : 0,
            "end" : 66
          },
          "caused_by" : {
            "type" : "null_pointer_exception",
            "reason" : "cannot access method/field [normalizeIndex] from a null def reference"
          }
        }
      }
    ]
  },
  "status" : 400
}

A null pointer exception occurs because the _source document is not available to a script query

Aside

It was possible to access _source in the script query context in earlier versions of Elasticsearch, but this was an error and a side effect of exposing _source to script fields. This has now been corrected in newer versions of Elasticsearch.

Is there another way?

Function score query script score function

The Painless contexts documentation lists the values that are available in each script context. The function score query's script score function exposes the _source document, allowing us to query for documents that have a value of 0.0 at the last index in the "values" array field

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "function_score": {
      "functions": [
        {
          "script_score": {
            "script": {
              "source": "params._source.values[params._source.values.size() - 1] == 0 ? 1 : 0"
            }
          }
        }
      ],
      "min_score": 1
    }
  }
}

which yields the expected document

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "values" : [
            1.02,
            1.03,
            1.04,
            0.0
          ]
        }
      }
    ]
  }

So, why does the script query not expose _source as an available parameter in the script context? The primary reason is that it's not a good idea to access _source because it's relatively very slow and is not a feasible solution at scale. Whilst using the script score function works for now, it's not the best way to tackle this problem; let's look at another solution.

Ingest pipeline

If we only ever need to query the last item in an array, we could use an ingest pipeline to copy the last value to another field that we can then query. To set up a pipeline

PUT /_ingest/pipeline/extract_last_value
{
  "processors": [
    {
      "script": {
       "source": "if (ctx.containsKey('values')) { ctx['last_value'] = ctx['values'][ctx['values'].size() - 1]; }"
      }
    }
  ]
}

Then, update all indexed documents by using update by query API along with the defined pipeline

POST my-data/_update_by_query?conflicts=proceed&pipeline=extract_last_value

Now, we can query using the "last_value" field

POST /my-data/_search?pretty&filter_path=hits.hits._source
{
  "query": {
    "term": {
      "last_value": 0.0
    }
  }
}

which returns the expected document

{
  "hits" : {
    "hits" : [
      {
        "_source" : {
          "last_value" : 0.0,
          "values" : [
            1.02,
            1.03,
            1.04,
            0.0
          ]
        }
      }
    ]
  }
}

This approach will be faster than scripting.