Difference between revisions of "ElasticSearch Query DSL"

From PaskvilWiki
Jump to: navigation, search
(Created page with "ES's Query DSL is a language for specifying queries in JSON. == Queries == == Filters ==")
 
 
(6 intermediate revisions by one user not shown)
Line 1: Line 1:
 
ES's Query DSL is a language for specifying queries in JSON.
 
ES's Query DSL is a language for specifying queries in JSON.
 +
 +
This is by far not an exhaustive documentation, it's just stuff I use the most; see [http://www.elasticsearch.org/guide/reference/query-dsl/ official documentation] for more. Especially the boosting and scoring functionality is not documented here to proper extent.
  
 
== Queries ==
 
== Queries ==
 +
 +
=== match, multi_match ===
 +
 +
The ''match'' queries accept, analyze, and construct query out of text/numeric/date. The match family of queries does not go through a "query parsing" process. It does not support field name prefixes, wildcard characters, or other "advance" features.
 +
 +
Here, ''message'' is name of the '''field to match in''' (can be also ''_all''):
 +
<pre>{
 +
    "match" : {
 +
        "message" : "this is a test"
 +
    }
 +
}</pre>
 +
 +
By default, terms are OR'ed; to '''AND''' them:
 +
<pre>{
 +
    "match" : {
 +
        "message" : {
 +
            "query" : "this is a test",
 +
            "operator" : "and"
 +
        }
 +
    }
 +
}</pre>
 +
 +
To '''match a phrase''':
 +
<pre>{
 +
    "match_phrase" : {
 +
        "message" : "this is a test"
 +
    }
 +
}</pre>
 +
 +
or using the '''last word as prefix''' (the "as you type" search):
 +
<pre>{
 +
    "match_phrase_prefix" : {
 +
        "message" : "this is a test"
 +
    }
 +
}</pre>
 +
 +
To match in '''multiple fields''', with optional boosting, use:
 +
<pre>{
 +
  "multi_match" : {
 +
    "query" : "this is a test",
 +
    "fields" : [ "subject^2", "message" ]
 +
  }
 +
}</pre>
 +
where matches in ''subject'' are "twice as important" as matched in ''message''.
 +
 +
=== bool ===
 +
 +
The ''bool'' query provides a Boolean combination of queries with typed occurrence:
 +
* '''must''' - clause must appear in matching documents,
 +
* '''should''' - should appear; is no ''must'' clause is provided, at least one ''should'' clause must be matched; you can also specify ''minimum_number_should_match'' parameter,
 +
* '''must_not''' appear.
 +
<pre>{
 +
    "bool" : {
 +
        "must" : {
 +
            "term" : { "user" : "kimchy" }
 +
        },
 +
        "must_not" : {
 +
            "range" : {
 +
                "age" : { "from" : 10, "to" : 20 }
 +
            }
 +
        },
 +
        "should" : [
 +
            {
 +
                "term" : { "tag" : "wow" }
 +
            },
 +
            {
 +
                "term" : { "tag" : "elasticsearch" }
 +
            }
 +
        ],
 +
        "minimum_number_should_match" : 1,
 +
        "boost" : 1.0
 +
    }
 +
}</pre>
 +
 +
=== boosting ===
 +
 +
Boosting can be used to promote or demote search results:
 +
<pre>{
 +
    "boosting" : {
 +
        "positive" : {
 +
            "term" : {
 +
                "field1" : "value1"
 +
            }
 +
        },
 +
        "negative" : {
 +
            "term" : {
 +
                "field2" : "value2"
 +
            }
 +
        },
 +
        "negative_boost" : 0.2
 +
    }
 +
}</pre>
 +
 +
=== ids ===
 +
 +
Match by ID:
 +
<pre>{
 +
    "ids" : {
 +
        "type" : "my_type",
 +
        "values" : ["1", "4", "100"]
 +
    }
 +
}</pre>
 +
''Note'': ''type'' field is optional, and may contain array of values.
 +
 +
=== field ===
 +
 +
Query only on a specified field (equivalent of ''query_string'' with ''default_field''):
 +
<pre>{
 +
    "field" : {
 +
        "name.first" : "+something -else"
 +
    }
 +
}</pre>
 +
 +
=== filtered ===
 +
 +
Filters results of a query; may be much faster than querying, as no scoring is done, and may be cached:
 +
<pre>{
 +
    "filtered" : {
 +
        "query" : {
 +
            "term" : { "tag" : "wow" }
 +
        },
 +
        "filter" : {
 +
            "range" : {
 +
                "age" : { "from" : 10, "to" : 20 }
 +
            }
 +
        }
 +
    }
 +
}</pre>
 +
 +
=== query_string ===
 +
 +
Uses query parser in order to parse its content.
 +
<pre>{
 +
    "query_string" : {
 +
        "default_field" : "content",
 +
        "query" : "this AND that OR thus"
 +
    }
 +
}</pre>
 +
 +
==== Parameters ====
 +
 +
* ''query'' - actual query to be parsed.
 +
* ''default_field'' - default field for query terms (if no prefix field specified); default ''index.query.default_field'' settings, which defaults to ''_all'',
 +
* ''fields'' - run query against multiple fields (provided as array):
 +
** <tt>"fields" : ["content", "name"]</tt>,
 +
** optionally with boosting: <tt>"fields" : ["content", "name^5"]</tt>,
 +
** wildcards may be used for fields: <tt>"fields" : ["city.*"]</tt> if document contains object ''city'',
 +
** to check for existence of nonexistence of fields, use: <tt>_exists_:field1</tt> and <tt>_missing_:field</tt>,
 +
* ''default_operator'' - default operator used (if none explicitly specified); e.g. with default operator ''OR'', the query "capital of Hungary" is translated to "capital OR of OR Hungary"; default is ''OR'',
 +
* ''allow_leading_wildcard'' - are '''*''' or '''?''' allowed as the first character? default ''true'',
 +
* ''lowercase_expanded_terms'' - should terms of wildcard, prefix, fuzzy, and range queries be automatically lower-cased? (since they are not analyzed); default ''true'',
 +
* ''boost'' - boost value of the query; default 1.0,
 +
* ''minimum_should_match'' - percent value ("20%") controlling how many "should" clauses in the resulting boolean query should match,
 +
* ''lenient'' - if true, format based failures (like providing text to a numeric field) to be ignored.
 +
 +
=== range ===
 +
 +
Matches documents by a provided range. For string fields, the ''TermRangeQuery'' is used, while for number/date fields, the query is a ''NumericRangeQuery''.
 +
<pre>{
 +
    "range" : {
 +
        "age" : {
 +
            "from" : 10,
 +
            "to" : 20,
 +
            "include_lower" : true,
 +
            "include_upper": false,
 +
            "boost" : 2.0
 +
        }
 +
    }
 +
}</pre>
 +
You can also use the following abbreviations:
 +
* ''gt'' = ''from'' + ''include_lower=false'',
 +
* ''gte'' = ''from'' + ''include_lower=true'',
 +
* ''lt'' = ''to'' + ''include_upper=false'',
 +
* ''lte'' = ''to'' + ''include_upper=true''.
 +
 +
=== term, terms ===
 +
 +
Matches documents that have fields that contain a term (not analyzed).
 +
 +
<pre>{ "term" : { "user" : "kimchy" } }</pre>
 +
<pre>{ "term" : { "user" : { "term" : "kimchy", "boost" : 2.0 } } }</pre>
  
 
== Filters ==
 
== Filters ==
 +
 +
Filters can be a great candidate for caching. Caching the result of a filter does not require a lot of memory, and will cause other queries executing against the same filter (same parameters) to be blazingly fast. Esp. ''term'', ''terms'', ''prefix'', and ''range'' filters, are by default cached and are recommended to use (compared to the equivalent query version).
 +
 +
=== and, or, not ===
 +
 +
Matches documents using AND operator on other queries, more performant than ''bool'' filter.
 +
 +
These filters are '''not''' cached by default.
 +
<pre>{
 +
    "filtered" : {
 +
        "query" : {
 +
            "term" : { "name.first" : "shay" }
 +
        },
 +
        "filter" : {
 +
            "and" : [
 +
                {
 +
                    "range" : {
 +
                        "postDate" : {
 +
                            "from" : "2010-03-01",
 +
                            "to" : "2010-04-01"
 +
                        }
 +
                    }
 +
                },
 +
                {
 +
                    "prefix" : { "name.second" : "ba" }
 +
                }
 +
            ]
 +
        }
 +
    }
 +
}</pre>
 +
 +
To cache the results of the filter:
 +
<pre>{
 +
    "filtered" : {
 +
        "query" : {
 +
            "term" : { "name.first" : "shay" }
 +
        },
 +
        "filter" : {
 +
            "or" : {
 +
                "filters" : [
 +
                    {
 +
                        "term" : { "name.second" : "banon" }
 +
                    },
 +
                    {
 +
                        "term" : { "name.nick" : "kimchy" }
 +
                    }
 +
                ],
 +
                "_cache" : true
 +
            }
 +
        }
 +
    }
 +
}</pre>
 +
 +
=== bool ===
 +
 +
Matches documents matching Boolean combinations of other queries. Similar to Boolean queries, but clauses are filters:
 +
<pre>{
 +
    "filtered" : {
 +
        "query" : {
 +
            "queryString" : {
 +
                "default_field" : "message",
 +
                "query" : "elasticsearch"
 +
            }
 +
        },
 +
        "filter" : {
 +
            "bool" : {
 +
                "must" : {
 +
                    "term" : { "tag" : "wow" }
 +
                },
 +
                "must_not" : {
 +
                    "range" : {
 +
                        "age" : { "from" : 10, "to" : 20 }
 +
                    }
 +
                },
 +
                "should" : [
 +
                    {
 +
                        "term" : { "tag" : "sometag" }
 +
                    },
 +
                    {
 +
                        "term" : { "tag" : "sometagtag" }
 +
                    }
 +
                ]
 +
            }
 +
        }
 +
    }
 +
}</pre>
 +
 +
=== exists, missing ===
 +
 +
Filters documents where a specific field has a value in them (''exists''), or has no value in them (''missing'').
 +
<pre>{
 +
    "constant_score" : {
 +
        "filter" : {
 +
            "exists" : { "field" : "user" }
 +
        }
 +
    }
 +
}</pre>
 +
 +
=== range, numeric_range ===
 +
 +
Unlike ''range'' query, ''range'' filter is cached; see ''range'' query for available parameters.
 +
<pre>{
 +
    "constant_score" : {
 +
        "filter" : {
 +
            "range" : {
 +
                "age" : {
 +
                    "from" : "10",
 +
                    "to" : "20",
 +
                    "include_lower" : true,
 +
                    "include_upper" : false
 +
                }
 +
            }
 +
        }
 +
    }
 +
}</pre>
 +
 +
The ''numeric_range'' filter loads relevant fields to memory and checks the numeric range; this requires more memory, but may be significantly faster.
 +
 +
Unlike ''range'', ''numeric_range'' filter results are '''not''' cached by default. Set ''_cache'' to ''true'' to do so. But if the filter is reused, it's advisable to simply use ''range'' filter.
 +
 +
=== query ===
 +
 +
Wraps a query to be used as a filter:
 +
<pre>{
 +
    "constantScore" : {
 +
        "filter" : {
 +
            "query" : {
 +
                "query_string" : {
 +
                    "query" : "this AND that OR thus"
 +
                }
 +
            }
 +
        }
 +
    }
 +
}</pre>
 +
 +
This is not cached by default; to allow caching (note that the format differs a bit from other filters):
 +
<pre>{
 +
    "constantScore" : {
 +
        "filter" : {
 +
            "fquery" : {
 +
                "query" : {
 +
                    "query_string" : {
 +
                        "query" : "this AND that OR thus"
 +
                    }
 +
                },
 +
                "_cache" : true
 +
            }
 +
        }
 +
    }
 +
}</pre>
 +
 +
=== term, terms ===
 +
 +
Filters documents that have fields that contain a term (not analyzed), and is cached by default:
 +
<pre>{
 +
    "constant_score" : {
 +
        "filter" : {
 +
            "term" : { "user" : "kimchy"}
 +
        }
 +
    }
 +
}</pre>
 +
 +
The ''terms'' filter simply accepts multiple terms, matching documents containing ''any'' of the terms; if you want to match ''all'' of the terms, use ''execution=and'':
 +
<pre>{
 +
    "constant_score" : {
 +
        "filter" : {
 +
            "terms" : {
 +
                "user" : ["kimchy", "elasticsearch"],
 +
                "execution" : "and"
 +
            }
 +
        }
 +
    }
 +
}</pre>

Latest revision as of 13:17, 23 January 2013

ES's Query DSL is a language for specifying queries in JSON.

This is by far not an exhaustive documentation, it's just stuff I use the most; see official documentation for more. Especially the boosting and scoring functionality is not documented here to proper extent.

Queries

match, multi_match

The match queries accept, analyze, and construct query out of text/numeric/date. The match family of queries does not go through a "query parsing" process. It does not support field name prefixes, wildcard characters, or other "advance" features.

Here, message is name of the field to match in (can be also _all):

{
    "match" : {
        "message" : "this is a test"
    }
}

By default, terms are OR'ed; to AND them:

{
    "match" : {
        "message" : {
            "query" : "this is a test",
            "operator" : "and"
        }
    }
}

To match a phrase:

{
    "match_phrase" : {
        "message" : "this is a test"
    }
}

or using the last word as prefix (the "as you type" search):

{
    "match_phrase_prefix" : {
        "message" : "this is a test"
    }
}

To match in multiple fields, with optional boosting, use:

{
  "multi_match" : {
    "query" : "this is a test",
    "fields" : [ "subject^2", "message" ]
  }
}

where matches in subject are "twice as important" as matched in message.

bool

The bool query provides a Boolean combination of queries with typed occurrence:

  • must - clause must appear in matching documents,
  • should - should appear; is no must clause is provided, at least one should clause must be matched; you can also specify minimum_number_should_match parameter,
  • must_not appear.
{
    "bool" : {
        "must" : {
            "term" : { "user" : "kimchy" }
        },
        "must_not" : {
            "range" : {
                "age" : { "from" : 10, "to" : 20 }
            }
        },
        "should" : [
            {
                "term" : { "tag" : "wow" }
            },
            {
                "term" : { "tag" : "elasticsearch" }
            }
        ],
        "minimum_number_should_match" : 1,
        "boost" : 1.0
    }
}

boosting

Boosting can be used to promote or demote search results:

{
    "boosting" : {
        "positive" : {
            "term" : {
                "field1" : "value1"
            }
        },
        "negative" : {
            "term" : {
                "field2" : "value2"
            }
        },
        "negative_boost" : 0.2
    }
}

ids

Match by ID:

{
    "ids" : {
        "type" : "my_type",
        "values" : ["1", "4", "100"]
    }
}

Note: type field is optional, and may contain array of values.

field

Query only on a specified field (equivalent of query_string with default_field):

{
    "field" : { 
        "name.first" : "+something -else"
    }
}

filtered

Filters results of a query; may be much faster than querying, as no scoring is done, and may be cached:

{
    "filtered" : {
        "query" : {
            "term" : { "tag" : "wow" }
        },
        "filter" : {
            "range" : {
                "age" : { "from" : 10, "to" : 20 }
            }
        }
    }
}

query_string

Uses query parser in order to parse its content.

{
    "query_string" : {
        "default_field" : "content",
        "query" : "this AND that OR thus"
    }
}

Parameters

  • query - actual query to be parsed.
  • default_field - default field for query terms (if no prefix field specified); default index.query.default_field settings, which defaults to _all,
  • fields - run query against multiple fields (provided as array):
    • "fields" : ["content", "name"],
    • optionally with boosting: "fields" : ["content", "name^5"],
    • wildcards may be used for fields: "fields" : ["city.*"] if document contains object city,
    • to check for existence of nonexistence of fields, use: _exists_:field1 and _missing_:field,
  • default_operator - default operator used (if none explicitly specified); e.g. with default operator OR, the query "capital of Hungary" is translated to "capital OR of OR Hungary"; default is OR,
  • allow_leading_wildcard - are * or ? allowed as the first character? default true,
  • lowercase_expanded_terms - should terms of wildcard, prefix, fuzzy, and range queries be automatically lower-cased? (since they are not analyzed); default true,
  • boost - boost value of the query; default 1.0,
  • minimum_should_match - percent value ("20%") controlling how many "should" clauses in the resulting boolean query should match,
  • lenient - if true, format based failures (like providing text to a numeric field) to be ignored.

range

Matches documents by a provided range. For string fields, the TermRangeQuery is used, while for number/date fields, the query is a NumericRangeQuery.

{
    "range" : {
        "age" : { 
            "from" : 10, 
            "to" : 20, 
            "include_lower" : true, 
            "include_upper": false, 
            "boost" : 2.0
        }
    }
}

You can also use the following abbreviations:

  • gt = from + include_lower=false,
  • gte = from + include_lower=true,
  • lt = to + include_upper=false,
  • lte = to + include_upper=true.

term, terms

Matches documents that have fields that contain a term (not analyzed).

{ "term" : { "user" : "kimchy" } }
{ "term" : { "user" : { "term" : "kimchy", "boost" : 2.0 } } }

Filters

Filters can be a great candidate for caching. Caching the result of a filter does not require a lot of memory, and will cause other queries executing against the same filter (same parameters) to be blazingly fast. Esp. term, terms, prefix, and range filters, are by default cached and are recommended to use (compared to the equivalent query version).

and, or, not

Matches documents using AND operator on other queries, more performant than bool filter.

These filters are not cached by default.

{
    "filtered" : {
        "query" : {
            "term" : { "name.first" : "shay" }
        },
        "filter" : {
            "and" : [
                {
                    "range" : { 
                        "postDate" : { 
                            "from" : "2010-03-01",
                            "to" : "2010-04-01"
                        }
                    }
                },
                {
                    "prefix" : { "name.second" : "ba" }
                }
            ]
        }
    }
}

To cache the results of the filter:

{
    "filtered" : {
        "query" : {
            "term" : { "name.first" : "shay" }
        },
        "filter" : {
            "or" : {
                "filters" : [
                    {
                        "term" : { "name.second" : "banon" }
                    },
                    {
                        "term" : { "name.nick" : "kimchy" }
                    }
                ],
                "_cache" : true
            }
        }
    }
}

bool

Matches documents matching Boolean combinations of other queries. Similar to Boolean queries, but clauses are filters:

{
    "filtered" : {
        "query" : {
            "queryString" : { 
                "default_field" : "message", 
                "query" : "elasticsearch"
            }
        },
        "filter" : {
            "bool" : {
                "must" : {
                    "term" : { "tag" : "wow" }
                },
                "must_not" : {
                    "range" : {
                        "age" : { "from" : 10, "to" : 20 }
                    }
                },
                "should" : [
                    {
                        "term" : { "tag" : "sometag" }
                    },
                    {
                        "term" : { "tag" : "sometagtag" }
                    }
                ]
            }
        }
    }
}

exists, missing

Filters documents where a specific field has a value in them (exists), or has no value in them (missing).

{
    "constant_score" : {
        "filter" : {
            "exists" : { "field" : "user" }
        }
    }
}

range, numeric_range

Unlike range query, range filter is cached; see range query for available parameters.

{
    "constant_score" : {
        "filter" : {
            "range" : {
                "age" : { 
                    "from" : "10", 
                    "to" : "20", 
                    "include_lower" : true, 
                    "include_upper" : false
                }
            }
        }
    }
}

The numeric_range filter loads relevant fields to memory and checks the numeric range; this requires more memory, but may be significantly faster.

Unlike range, numeric_range filter results are not cached by default. Set _cache to true to do so. But if the filter is reused, it's advisable to simply use range filter.

query

Wraps a query to be used as a filter:

{
    "constantScore" : {
        "filter" : {
            "query" : { 
                "query_string" : { 
                    "query" : "this AND that OR thus"
                }
            }
        }
    }
}

This is not cached by default; to allow caching (note that the format differs a bit from other filters):

{
    "constantScore" : {
        "filter" : {
            "fquery" : {
                "query" : { 
                    "query_string" : { 
                        "query" : "this AND that OR thus"
                    }
                },
                "_cache" : true
            }
        }
    }
}

term, terms

Filters documents that have fields that contain a term (not analyzed), and is cached by default:

{
    "constant_score" : {
        "filter" : {
            "term" : { "user" : "kimchy"}
        }
    }
}

The terms filter simply accepts multiple terms, matching documents containing any of the terms; if you want to match all of the terms, use execution=and:

{
    "constant_score" : {
        "filter" : {
            "terms" : {
                "user" : ["kimchy", "elasticsearch"],
                "execution" : "and"
            }
        }
    }
}