Difference between revisions of "ElasticSearch Search"

From PaskvilWiki
Jump to: navigation, search
 
(5 intermediate revisions by one user not shown)
Line 1: Line 1:
 
[http://www.elasticsearch.org/guide/reference/api/search/ official documentation]
 
[http://www.elasticsearch.org/guide/reference/api/search/ official documentation]
 +
 +
[[Small elasticsearch Notes|Back to Small elasticsearch Notes]]
  
 
Search can be executed across indices and types, with query string as a parameter, or using a request body.
 
Search can be executed across indices and types, with query string as a parameter, or using a request body.
  
 
Search is broadcasted to all the index/indices shards; to limit the scope, you can use ''routing'' parameter - e.g. using user ID when searching through tweets by given user. The ''routing'' parameter may be multivalued, CSV.
 
Search is broadcasted to all the index/indices shards; to limit the scope, you can use ''routing'' parameter - e.g. using user ID when searching through tweets by given user. The ''routing'' parameter may be multivalued, CSV.
 +
 +
See also [http://www.elasticsearch.org/guide/reference/api/search/highlighting.html highlighting], and other topics not covered here...
  
 
== Request Body ==
 
== Request Body ==
Line 124: Line 128:
 
* all indices and types
 
* all indices and types
 
<pre>$ curl -XGET 'http://localhost:9200/_search?q=user:kimchy'</pre>
 
<pre>$ curl -XGET 'http://localhost:9200/_search?q=user:kimchy'</pre>
 +
 +
== Sorting ==
 +
 +
The sort is defined on a per field level, with special field name for ''_score'' to sort by score:
 +
<pre>{
 +
    "sort" : [
 +
        { "post_date" : {"order" : "asc"} },
 +
        "user",
 +
        { "name" : "desc" },
 +
        { "age" : "desc" },
 +
        "_score"
 +
    ],
 +
    "query" : {
 +
        "term" : { "user" : "kimchy" }
 +
    }
 +
}</pre>
 +
The sort values for each document returned are also returned as part of the response.
 +
 +
=== Missing Numeric Fields ===
 +
 +
Numeric fields support specific handling for missing fields in a doc. The ''missing'' value can be ''_last'', ''_first'', or a custom value (that will be used for missing docs as the sort value). For example:
 +
<pre>{
 +
    "sort" : [
 +
        { "price" : {"missing" : "_last"} },
 +
    ],
 +
    "query" : {
 +
        "term" : { "user" : "kimchy" }
 +
    }
 +
}</pre>
 +
 +
=== Missing Mapping for Field ===
 +
 +
By default, the search request '''will fail if there is no mapping associated with a field'''. The ''ignore_unmapped'' option allows to ignore fields that have no mapping and not sort by them. Here is an example of how it can be used:
 +
<pre>{
 +
    "sort" : [
 +
        { "price" : {"ignore_unmapped" : true} },
 +
    ],
 +
    "query" : {
 +
        "term" : { "user" : "kimchy" }
 +
    }
 +
}</pre>
 +
 +
=== GeoDistance ===
 +
 +
You can also sort by '''''_geo_distance''''':
 +
<pre>{
 +
    "sort" : [
 +
        {
 +
            "_geo_distance" : {
 +
                "pin.location" : [-70, 40],
 +
                "order" : "asc",
 +
                "unit" : "km"
 +
            }
 +
        }
 +
    ],
 +
    "query" : {
 +
        "term" : { "user" : "kimchy" }
 +
    }
 +
}</pre>
 +
 +
The ''_geo_distance'' pin may be provided as:
 +
<pre>
 +
properties:        "pin.location" : { "lat" : 40, "lon", -70 }
 +
string:            "pin.location" : "-70,40"
 +
geohash:            "pin.location" : "drm3btev3e86"
 +
array:              "pin.location" : [-70, 40]</pre>
 +
 +
=== Script Based Sorting ===
 +
 +
<pre>{
 +
    "query" : {
 +
        ....
 +
    },
 +
    "sort" : {
 +
        "_script" : {
 +
            "script" : "doc['field_name'].value * factor",
 +
            "type" : "number",
 +
            "params" : {
 +
                "factor" : 1.1
 +
            },
 +
            "order" : "asc"
 +
        }
 +
    }
 +
}</pre>
 +
 +
'''Note''': for single field based sorting, use ''custom_score'' query - it's faster.
 +
 +
=== Scores ===
 +
 +
When sorting on a field, scores are not computed. By setting ''track_scores'' to ''true'', scores will still be computed and tracked.
 +
<pre>{
 +
    "track_scores": true,
 +
    "sort" : [
 +
        { "post_date" : {"reverse" : true} },
 +
        { "name" : "desc" },
 +
        { "age" : "desc" }
 +
    ],
 +
    "query" : {
 +
        "term" : { "user" : "kimchy" }
 +
    }
 +
}</pre>
 +
 +
=== Note ===
 +
 +
Beware that all relevant fields used for sorting have to be loaded to memory.
 +
 +
When sorting on string fields, the field sorted on should not be analyzed/tokenized.
 +
For numeric types, it is recommended to explicitly set the type - ''short'', ''integer'', ''float'', ... .
 +
 +
== Fields ==
 +
 +
By default, ES loads the internal ''_source'' field.
 +
 +
The fields will automatically load stored fields (''store'' mapping set to ''yes''), or, if not stored, will load the ''_source'' and extract it from it (allowing to return nested document object).
 +
 +
The '''*''' can be used to load all stored fields from the document.
 +
 +
An empty array will cause only the ''_id'' and ''_type'' for each hit to be returned, for example:
 +
<pre>{
 +
    "fields" : [],
 +
    "query" : {
 +
        "term" : { "user" : "kimchy" }
 +
    }
 +
}</pre>
 +
 +
=== Partial ===
 +
 +
You can also specify only parts of the document to be included or excluded from loading, using ''partial_fields'' (each of them supports multiple patterns):
 +
<pre>{
 +
    "query" : {
 +
        "match_all" : {}
 +
    },
 +
    "partial_fields" : {
 +
        "partial1" : {
 +
            "include" : ["obj1.obj2.*", "obj1.obj4.*"],
 +
            "exclude" : "obj1.obj3.*"
 +
        }
 +
    }
 +
}</pre>
 +
 +
=== Script Fields ===
 +
 +
ES allows to return fields created by script evaluation:
 +
<pre>{
 +
    "query" : {
 +
        ...
 +
    },
 +
    "script_fields" : {
 +
        "test1" : {
 +
            "script" : "doc['my_field_name'].value * 2"
 +
        },
 +
        "test2" : {
 +
            "script" : "doc['my_field_name'].value * factor",
 +
            "params" : {
 +
                "factor"  : 2.0
 +
            }
 +
        }
 +
        "test3" : {
 +
            "script" : "_source.obj1.obj2"
 +
        }
 +
    }
 +
}</pre>
 +
 +
Its important to understand the difference between ''doc['my_field'].value'' and ''_source.my_field''. The first, using the ''doc'' keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the ''doc[...]'' notation only allows for simple valued fields (can't return a json object from it) and make sense only on non analyzed or single term based fields. The ''_source'' on the other hand causes the source to be loaded, parsed, and then only the relevant part of the JSON is returned.
 +
 +
== Facets ==
 +
 +
The field used for facet calculations '''must''' be of type ''numeric'', ''date/time'' or be analyzed as a ''single token'' — see the [[Small elasticsearch Notes#Mappings|Mapping guide]] for details on the analysis process.
 +
 +
You can give the facet a custom name and return multiple facets in one request.
 +
 +
TODO

Latest revision as of 10:39, 24 January 2013

official documentation

Back to Small elasticsearch Notes

Search can be executed across indices and types, with query string as a parameter, or using a request body.

Search is broadcasted to all the index/indices shards; to limit the scope, you can use routing parameter - e.g. using user ID when searching through tweets by given user. The routing parameter may be multivalued, CSV.

See also highlighting, and other topics not covered here...

Request Body

Request body uses Query DSL.

$ curl -XGET 'http://localhost:9200/twitter/tweet/_search' -d '{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}'
{
    "_shards":{
        "total" : 5,
        "successful" : 5,
        "failed" : 0
    },
    "hits":{
        "total" : 1,
        "hits" : [
            {
                "_index" : "twitter",
                "_type" : "tweet",
                "_id" : "1", 
                "_source" : {
                    "user" : "kimchy",
                    "postDate" : "2009-11-15T14:12:12",
                    "message" : "trying out Elastic Search"
                }
            }
        ]
    }
}

Parameters

  • timeout - search timeout, bounding the search request to be executed within the specified time; default no timeout,
  • from - the starting from index of the hits to return; default 0,
  • size - the number of hits to return; default 10,
  • search_type - the type of the search operation to perform - dfs_query_then_fetch, dfs_query_and_fetch, query_then_fetch, query_and_fetch; defaults query_then_fetch; see Search Type for more details on the different types of search that can be performed.

URI Request

A search request can be executed purely using a URI by providing request parameters.

$ curl -XGET 'http://localhost:9200/twitter/tweet/_search?q=user:kimchy'

Parameters

  • q - query string (maps to the query_string query, see Query String Query for more details),
  • df - default field to use when no field prefix is defined,
  • default_operator - default operator to be used - AND or OR, default OR,
  • explain - include explanation of how scoring of the each hits was computed,
  • fields - selective fields of the document to return, CSV; default internal _source field; empty value will cause no fields to return,
  • sort - sorting to perform - fieldName, fieldName:asc, or fieldName:desc; fieldName can either be an actual field, or _score; there can be several sort parameters (CSV, order is important),
  • track_scores - when sorting, set to true in order to return score as part of each hit,
  • timeout - search timeout, limiting the execution time; all results accumulated up to timeout are returned; defaults to no timeout,
  • from - starting from index of the hits to return; defaults 0,
  • size - number of hits to return; defaults 10,
  • lowercase_expanded_terms - should terms be automatically lowercased or not; default true,
  • analyze_wildcard - should wildcard and prefix queries be analyzed or not; default false.

Query Element

See Query DSL for details.

{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Filter Element

When doing things like facet navigation, sometimes only the hits are needed to be filtered by the chosen facet, and all the facets should continue to be calculated based on the original query. The filter element within the search request can be used to accomplish it.

Note, this is different compared to creating a filtered query with the filter, since this will cause the facets to only process the filtered results.

In other words, using

{
    "query" : { "term" : { "message" : "something" } },
    "filter" : { "term" : { "tag" : "green" } },
    "facets" : { "tag" : { "terms" : { "field" : "tag" } } }
}

the filter will not change the facets (the results of facets will be the same as without the filter element), while the results set will be different.

But using filtered query:

{
    "filtered" : {
        "query" : { "message" : "something" },
        "filter" : { "term" : { "tag" : "green" } }
        }
    },
    "facets" : { "tag" : { "terms" : { "field" : "tag" } } }
}

the filter field within the filtered query element will change the facets, influencing both the results set and the facets.

To filter the facets, you can use facet_filter element.

From and Size, Pagination

{
    "from" : 0, "size" : 10,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Indices and Types

  • specific index and type
$ curl -XGET 'http://localhost:9200/twitter/user/_search?q=user:kimchy'
  • specific index, multiple types
$ curl -XGET 'http://localhost:9200/twitter/user,tweet/_search?q=user:kimchy'
  • multiple indices, all types
$ curl -XGET 'http://localhost:9200/twitter,facebook/_search?q=user:kimchy'
  • all indices, specific type
$ curl -XGET 'http://localhost:9200/_all/tweet/_search?q=user:kimchy'
  • all indices and types
$ curl -XGET 'http://localhost:9200/_search?q=user:kimchy'

Sorting

The sort is defined on a per field level, with special field name for _score to sort by score:

{
    "sort" : [
        { "post_date" : {"order" : "asc"} },
        "user",
        { "name" : "desc" },
        { "age" : "desc" },
        "_score"
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

The sort values for each document returned are also returned as part of the response.

Missing Numeric Fields

Numeric fields support specific handling for missing fields in a doc. The missing value can be _last, _first, or a custom value (that will be used for missing docs as the sort value). For example:

{
    "sort" : [
        { "price" : {"missing" : "_last"} },
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Missing Mapping for Field

By default, the search request will fail if there is no mapping associated with a field. The ignore_unmapped option allows to ignore fields that have no mapping and not sort by them. Here is an example of how it can be used:

{
    "sort" : [
        { "price" : {"ignore_unmapped" : true} },
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

GeoDistance

You can also sort by _geo_distance:

{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [-70, 40],
                "order" : "asc",
                "unit" : "km"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

The _geo_distance pin may be provided as:

properties:         "pin.location" : { "lat" : 40, "lon", -70 }
string:             "pin.location" : "-70,40"
geohash:            "pin.location" : "drm3btev3e86"
array:              "pin.location" : [-70, 40]

Script Based Sorting

{
    "query" : {
        ....
    },
    "sort" : {
        "_script" : { 
            "script" : "doc['field_name'].value * factor",
            "type" : "number",
            "params" : {
                "factor" : 1.1
            },
            "order" : "asc"
        }
    }
}

Note: for single field based sorting, use custom_score query - it's faster.

Scores

When sorting on a field, scores are not computed. By setting track_scores to true, scores will still be computed and tracked.

{
    "track_scores": true,
    "sort" : [
        { "post_date" : {"reverse" : true} },
        { "name" : "desc" },
        { "age" : "desc" }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Note

Beware that all relevant fields used for sorting have to be loaded to memory.

When sorting on string fields, the field sorted on should not be analyzed/tokenized. For numeric types, it is recommended to explicitly set the type - short, integer, float, ... .

Fields

By default, ES loads the internal _source field.

The fields will automatically load stored fields (store mapping set to yes), or, if not stored, will load the _source and extract it from it (allowing to return nested document object).

The * can be used to load all stored fields from the document.

An empty array will cause only the _id and _type for each hit to be returned, for example:

{
    "fields" : [],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Partial

You can also specify only parts of the document to be included or excluded from loading, using partial_fields (each of them supports multiple patterns):

{
    "query" : {
        "match_all" : {}
    },
    "partial_fields" : {
        "partial1" : {
            "include" : ["obj1.obj2.*", "obj1.obj4.*"],
            "exclude" : "obj1.obj3.*"
        }
    }
}

Script Fields

ES allows to return fields created by script evaluation:

{
    "query" : {
        ...
    },
    "script_fields" : {
        "test1" : {
            "script" : "doc['my_field_name'].value * 2"
        },
        "test2" : {
            "script" : "doc['my_field_name'].value * factor",
            "params" : {
                "factor"  : 2.0
            }
        }
        "test3" : {
            "script" : "_source.obj1.obj2" 
        }
    }
}

Its important to understand the difference between doc['my_field'].value and _source.my_field. The first, using the doc keyword, will cause the terms for that field to be loaded to memory (cached), which will result in faster execution, but more memory consumption. Also, the doc[...] notation only allows for simple valued fields (can't return a json object from it) and make sense only on non analyzed or single term based fields. The _source on the other hand causes the source to be loaded, parsed, and then only the relevant part of the JSON is returned.

Facets

The field used for facet calculations must be of type numeric, date/time or be analyzed as a single token — see the Mapping guide for details on the analysis process.

You can give the facet a custom name and return multiple facets in one request.

TODO