Elasticsearch allows indexing multiple values into a field, and to query on that field to find documents with matching values. We'll use the following documents indexed into Elasticsearch as an example
POST /my-data/_bulk?refresh=wait_for
{"index":{}}
{"values":[1.02, 1.03, 1.04, 0.0]}
{"index":{}}
{"values":[0.0, 1.02, 1.03]}
{"index":{}}
{"values":[0.0, -1.0]}
Terms query
Running a terms
query, we can retrieve documents that have the values 0.0 or 1.02 or 1.03 in the "values" field
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"terms": {
"values": [0.0, 1.02, 1.03]
}
}
}
This returns all three documents
{
"hits" : {
"hits" : [
{
"_source" : {
"values" : [
1.02,
1.03,
1.04,
0.0
]
}
},
{
"_source" : {
"values" : [
0.0,
1.02,
1.03
]
}
},
{
"_source" : {
"values" : [
0.0,
-1.0
]
}
}
]
}
}
Combined term queries
If we want to get just those documents that have the values 0.0 and 1.02 and 1.03 in the "values"
field, we can combine three term
queries in a bool query
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"bool": {
"filter": [
{
"term": {
"values": 0.0
}
},
{
"term": {
"values": 1.02
}
},
{
"term": {
"values": 1.03
}
}
]
}
}
}
which yields
{
"hits" : {
"hits" : [
{
"_source" : {
"values" : [
1.02,
1.03,
1.04,
0.0
]
}
},
{
"_source" : {
"values" : [
0.0,
1.02,
1.03
]
}
}
]
}
}
Terms set query
If we want to get just those documents that have the values 0.0 and 1.02 and 1.03 in the "values"
field, and only these values and no others, we can use the terms_set
query
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"terms_set": {
"values": {
"terms": [0.0, 1.02, 1.03],
"minimum_should_match_script": {
"source": "Math.max(params.num_terms, doc['values'].size())"
}
}
}
}
}
which yields
{
"hits" : {
"hits" : [
{
"_source" : {
"values" : [
0.0,
1.02,
1.03
]
}
}
]
}
}
Using the maximum value of either the count of values in the individual document or the number of terms passed in the query as the minimum number of terms that must match, limits the documents only to those with a count of values and values that match those passed
Querying individual elements in an array field
What if we wanted to query for documents that have a value of 0.0 at the last index in the "values"
array field?
Script query
We might try to tackle this by using a script
query and checking that the last value in doc values is 0
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"script": {
"script": {
"source": "doc['values'][doc['values'].size() - 1] == 0"
}
}
}
}
But this returns
{
"hits" : {
"hits" : [
{
"_source" : {
"values" : [
0.0,
-1.0
]
}
}
]
}
}
When we would have expected it to return a single document with the values [1.02, 1.03, 1.04, 0.0]
. Using Painless scripting's Debug.explain
may provide more details
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"script": {
"script": {
"source": "Debug.explain(doc['values'])"
}
}
}
}
which yields (removing some details for brevity)
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"painless_class" : "org.elasticsearch.index.fielddata.ScriptDocValues.Doubles",
"to_string" : "[0.0, 1.0199999809265137, 1.0299999713897705, 1.0399999618530273]",
"java_class" : "org.elasticsearch.index.fielddata.ScriptDocValues$Doubles",
"script_stack" : [
"Debug.explain(doc['values'])",
" ^---- HERE"
],
"script" : "Debug.explain(doc['values'])",
"lang" : "painless",
"position" : {
"offset" : 17,
"start" : 0,
"end" : 28
}
}
]
},
"status" : 400
}
The two most important lines are the painless_class
and to_string
, which tell us the type of doc['values']
and the values it holds, respectively. We're dealing with ScriptDocValues.Doubles
, a collection type for double
values that come from the Doc values column-oriented data structure, and the results show that the values are retrieved from doc values in ascending order. In other words, any information related to the original positions of values in the "values"
field of the indexed document are lost, which is why the document with values [0.0, -1.0]
was returned in the results, even though -1.0 was the last value in the field.
Script query with _source
Doc values does not preserve positions, but the original _source document does, so we can try querying this
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"script": {
"script": {
"source": "params._source['values'][params._source['values'].size() - 1] == 0"
}
}
}
}
This however returns a null pointer exception
{
"error" : {
"root_cause" : [
{
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"params._source['values'][params._source['values'].size() - 1] == 0",
" ^---- HERE"
],
"script" : "params._source['values'][params._source['values'].size() - 1] == 0",
"lang" : "painless",
"position" : {
"offset" : 6,
"start" : 0,
"end" : 66
}
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "my-data",
"node" : "AEw1g-tvQj6PKnBZC2Boag",
"reason" : {
"type" : "script_exception",
"reason" : "runtime error",
"script_stack" : [
"params._source['values'][params._source['values'].size() - 1] == 0",
" ^---- HERE"
],
"script" : "params._source['values'][params._source['values'].size() - 1] == 0",
"lang" : "painless",
"position" : {
"offset" : 6,
"start" : 0,
"end" : 66
},
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : "cannot access method/field [normalizeIndex] from a null def reference"
}
}
}
]
},
"status" : 400
}
A null pointer exception occurs because the _source document is not available to a script query
Aside
It was possible to access _source in the script query context in earlier versions of Elasticsearch, but this was an error and a side effect of exposing _source to script fields. This has now been corrected in newer versions of Elasticsearch.
Is there another way?
Function score query script score function
The Painless contexts documentation lists the values that are available in each script context. The function score query's script score function exposes the _source document, allowing us to query for documents that have a value of 0.0 at the last index in the "values"
array field
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"function_score": {
"functions": [
{
"script_score": {
"script": {
"source": "params._source.values[params._source.values.size() - 1] == 0 ? 1 : 0"
}
}
}
],
"min_score": 1
}
}
}
which yields the expected document
{
"hits" : {
"hits" : [
{
"_source" : {
"values" : [
1.02,
1.03,
1.04,
0.0
]
}
}
]
}
So, why does the script
query not expose _source as an available parameter in the script context? The primary reason is that it's not a good idea to access _source because it's relatively very slow and is not a feasible solution at scale. Whilst using the script score function works for now, it's not the best way to tackle this problem; let's look at another solution.
Ingest pipeline
If we only ever need to query the last item in an array, we could use an ingest pipeline to copy the last value to another field that we can then query. To set up a pipeline
PUT /_ingest/pipeline/extract_last_value
{
"processors": [
{
"script": {
"source": "if (ctx.containsKey('values')) { ctx['last_value'] = ctx['values'][ctx['values'].size() - 1]; }"
}
}
]
}
Then, update all indexed documents by using update by query API along with the defined pipeline
POST my-data/_update_by_query?conflicts=proceed&pipeline=extract_last_value
Now, we can query using the "last_value"
field
POST /my-data/_search?pretty&filter_path=hits.hits._source
{
"query": {
"term": {
"last_value": 0.0
}
}
}
which returns the expected document
{
"hits" : {
"hits" : [
{
"_source" : {
"last_value" : 0.0,
"values" : [
1.02,
1.03,
1.04,
0.0
]
}
}
]
}
}
This approach will be faster than scripting.