Difference between revisions of "Small elasticsearch Notes"
Line 98: | Line 98: | ||
You can use the [http://en.wikipedia.org/wiki/Optimistic_concurrency_control OCC] in ES to make sure you're updating the document you started the update on: | You can use the [http://en.wikipedia.org/wiki/Optimistic_concurrency_control OCC] in ES to make sure you're updating the document you started the update on: | ||
− | |||
<pre>$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?version=1' -d '{"name":"Shay Boron"}'</pre> | <pre>$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?version=1' -d '{"name":"Shay Boron"}'</pre> | ||
Line 104: | Line 103: | ||
Note that if you're updating old version (new version appeared in the meantime), you'll get an error: | Note that if you're updating old version (new version appeared in the meantime), you'll get an error: | ||
− | + | <pre>{"error":"VersionConflictEngineException[[twitter][3] [user][kimchy]: version conflict, current [2], provided [1]]","status":409}</pre> | |
− | + | ||
It's up to you then to get new version and updated data accordingly and PUT again. | It's up to you then to get new version and updated data accordingly and PUT again. | ||
Line 123: | Line 121: | ||
You'll receive the generated (UUIDv4) ID in the response: | You'll receive the generated (UUIDv4) ID in the response: | ||
− | |||
<pre>{ | <pre>{ | ||
"ok" : true, | "ok" : true, | ||
Line 172: | Line 169: | ||
You can get document from index by type and ID: | You can get document from index by type and ID: | ||
− | |||
− | |||
<pre>curl -XGET 'http://localhost:9200/twitter/tweet/1'</pre> | <pre>curl -XGET 'http://localhost:9200/twitter/tweet/1'</pre> | ||
<pre>{ | <pre>{ | ||
Line 209: | Line 204: | ||
'''From various indices''' | '''From various indices''' | ||
− | |||
<pre>curl 'localhost:9200/_mget' -d '{ | <pre>curl 'localhost:9200/_mget' -d '{ | ||
"docs" : [ | "docs" : [ | ||
Line 226: | Line 220: | ||
'''From same index, various types''' | '''From same index, various types''' | ||
− | |||
<pre>curl 'localhost:9200/test/_mget' -d '{ | <pre>curl 'localhost:9200/test/_mget' -d '{ | ||
"docs" : [ | "docs" : [ | ||
Line 241: | Line 234: | ||
'''Same index, same type''' | '''Same index, same type''' | ||
− | |||
<pre>curl 'localhost:9200/test/type/_mget' -d '{ "ids" : ["1", "2"] }'</pre> | <pre>curl 'localhost:9200/test/type/_mget' -d '{ "ids" : ["1", "2"] }'</pre> | ||
You can also specify fields to fetch: | You can also specify fields to fetch: | ||
− | |||
<pre>curl 'localhost:9200/_mget' -d '{ | <pre>curl 'localhost:9200/_mget' -d '{ | ||
"docs" : [ | "docs" : [ | ||
Line 262: | Line 253: | ||
] | ] | ||
}'</pre> | }'</pre> | ||
+ | |||
+ | == Updating == | ||
+ | |||
+ | [http://www.elasticsearch.org/guide/reference/api/update.html official documentation] | ||
+ | |||
+ | The operation gets the document, runs the script, and indexes back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex". | ||
+ | |||
+ | ''Note'': full document reindex is still needed, just cutting down the network round-trip and using versioning to avoid conflicts. | ||
+ | |||
+ | === Example === | ||
+ | |||
+ | <pre># create a document | ||
+ | curl -XPUT localhost:9200/test/type1/1 -d '{ | ||
+ | "counter" : 1, | ||
+ | "tags" : ["red"] | ||
+ | }' | ||
+ | |||
+ | # update the counter - increment by 4 | ||
+ | curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ | ||
+ | "script" : "ctx._source.counter += count", | ||
+ | "params" : { | ||
+ | "count" : 4 | ||
+ | } | ||
+ | }' | ||
+ | |||
+ | # add a tag (might add a duplicate, since it's just a list!) | ||
+ | curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ | ||
+ | "script" : "ctx._source.tags += tag", | ||
+ | "params" : { | ||
+ | "tag" : "blue" | ||
+ | } | ||
+ | }' | ||
+ | |||
+ | # add a new field | ||
+ | curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ | ||
+ | "script" : "ctx._source.text = \"some text\"" | ||
+ | }' | ||
+ | |||
+ | # remove a field | ||
+ | curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ | ||
+ | "script" : "ctx._source.remove(\"text\")" | ||
+ | }' | ||
+ | |||
+ | # delete documents with tag 'blue', or ignore/noop | ||
+ | curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ | ||
+ | "script" : "ctx._source.tags.contains(tag) ? ctx.op = \"delete\" : ctx.op = \"none\"", | ||
+ | "params" : { | ||
+ | "tag" : "blue" | ||
+ | } | ||
+ | }' | ||
+ | |||
+ | # "full" document may also be POST'ed, causing the final document to be a merge of original and update | ||
+ | curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ | ||
+ | "doc" : { | ||
+ | "name" : "new_name" | ||
+ | } | ||
+ | }' | ||
+ | |||
+ | # if field does not exist, its "default" may be specified as 'upsert' | ||
+ | curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ | ||
+ | "script" : "ctx._source.counter += count", | ||
+ | "params" : { | ||
+ | "count" : 4 | ||
+ | }, | ||
+ | "upsert" : { | ||
+ | "counter" : 1 | ||
+ | } | ||
+ | }'</pre> | ||
+ | |||
+ | === Update Parameters === | ||
+ | |||
+ | * ''routing'' - select routing/shard | ||
+ | * ''timeout'' - timeout waiting for a shard to become available | ||
+ | * ''replication'' - the replication type for the delete/index operation (sync or async) | ||
+ | * ''consistency'' - the write consistency of the index/delete operation | ||
+ | * ''percolate'' - enables percolation and filters out which percolator queries will be executed | ||
+ | * ''refresh'' - refresh the index immediately after the operation occurs, so that the updated document appears in search results immediately (might increase load and network load) | ||
+ | * ''fields'' - return the relevant fields from the document updated; use ''_source'' to return the full updated source | ||
+ | * ''retry_on_conflict'' - how many times to retry if there is a version conflict; default 0 |
Revision as of 16:16, 22 January 2013
I was really surprised by elasticsearch (ES further on) - the simplicity of setup and configuration, and by it's powers and options.
Contents
Installation
Download, unpack, and run es/bin/elasticsearch. Yes, that's it. Amazing, isn't it?
What You Get
After the above 30-sec setup, you have a search engine running on http://localhost:9200/, with automatic sharding (unlike with other systems, ES is sharded always - even on a single machine), replication, and much much more.
Few highlights:
- ES sports a neat RESTful API that communicates (almost) entirely in JSON,
- ES is schemaless, unless you want it to be,
- you can hint ES on many tasks - e.g. hint what shards to search in, etc.
- indices are created on the fly, no need to precreate (yes, might be tougher to find a bug, but installation of a new system is a breeze),
- you can specify what indices to search, or what document types, you can search over a group or all or just one,
- documents are versioned; also, adding a document with the same ID does not replace the old document - this might or might not be what you want,
Indexing
Example
Lets start with an add-get example:
# lets add (type) _user_ to _twitter_ index, with ID _kimchy_ $ curl -XPUT 'http://localhost:9200/twitter/user/kimchy' -d '{ "name" : "Shay Banon" }' > {"ok":true,"_index":"twitter","_type":"user","_id":"kimchy","_version":2} $ curl -XGET 'http://localhost:9200/twitter/user/kimchy?pretty=true' > { > "_index" : "twitter", > "_type" : "user", > "_id" : "kimchy", > "_version" : 1, > "exists" : true, "_source" : { "name" : "Shay Banon" } > } # lets add one more _user_ to _twitter_ with the same ID $ curl -XPUT 'http://localhost:9200/twitter/user/kimchy' -d '{ "name" : "Shay Baror" }' > {"ok":true,"_index":"twitter","_type":"user","_id":"kimchy","_version":2} # note the increase in version number $ curl -XGET 'http://localhost:9200/twitter/user/kimchy?pretty=true' > { > "_index" : "twitter", > "_type" : "user", > "_id" : "kimchy", > "_version" : 2, > "exists" : true, "_source" : { "name" : "Shay Baror" } > } # now, lets search for "shay" users $ curl -XGET 'http://localhost:9200/twitter/user/_search?q=name:shay&pretty=true' > { > "took" : 491, > "timed_out" : false, > "_shards" : { > "total" : 5, > "successful" : 5, > "failed" : 0 > }, > "hits" : { > "total" : 1, > "max_score" : 0.625, > "hits" : [ { > "_index" : "twitter", > "_type" : "user", > "_id" : "kimchy", > "_score" : 0.625, "_source" : { "name" : "Shay Baror" } > } ] > } > } # to search only among all types in _twitter_ index $ curl -XGET 'http://localhost:9200/twitter/_search?q=name:shay' # finally, you may search all indices $ curl -XGET 'http://localhost:9200/_search?q=name:shay' # or just selected indices - _twitter_ and _facebook_ $ curl -XGET 'http://localhost:9200/twitter,facebook/_search?q=name:shay' # or on all indices starting with _t_, excluding _twitter_
Note that you get most of the useful information, and very little superfluous. Of course, without the pretty=true parameter, you get the "normal" more compressed version of JSON.
Creating Documents
You create/index documents by PUT'ing them to index as type with docid ID:
$ curl -XPUT 'http://localhost:9200/index/type/docid' -d '{"content":"trying out Elastic Search"}'
Versioning
Note that documents are versioned rather than replaced if PUT'ed more than once.
You can use the OCC in ES to make sure you're updating the document you started the update on:
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?version=1' -d '{"name":"Shay Boron"}'
This will result in a document with same data as version 1, with name field updated/added.
Note that if you're updating old version (new version appeared in the meantime), you'll get an error:
{"error":"VersionConflictEngineException[[twitter][3] [user][kimchy]: version conflict, current [2], provided [1]]","status":409}
It's up to you then to get new version and updated data accordingly and PUT again.
PUT-if-absent
To create document only if not present in index yet, use create parameter (the following 2 calls are equivalent):
$ curl -XPUT 'http://localhost:9200/twitter/tweet/1?op_type=create' -d '{...}' $ curl -XPUT 'http://localhost:9200/twitter/tweet/1/_create' -d '{...}'
Automatic ID Generation
You can create document without providing ID. Note that the call is POST not PUT! (this, of course, automatically sets op_type to create).
curl -XPOST 'http://localhost:9200/twitter/tweet/' -d '{...}'
You'll receive the generated (UUIDv4) ID in the response:
{ "ok" : true, "_index" : "twitter", "_type" : "tweet", "_id" : "6a8ca01c-7896-48e9-81cc-9f70661fcb32", "_version" : 1 }
Indices and Types
Index is automatically created if it does not exist. Data type mapping is also automatically created/updated.
Indices can also be created "manually", as well as type mappings.
By setting action.auto_create_index to false in configuration, indices need to be created manually before use. Same goes for type mapping - index.mapper.dynamic.
You can also white/black-list indices by name, which are to be created automatically and manually, by setting action.auto_create_index to +aaa*,-bbb*,+ccc*,-*.
Routing
By default, shard used to store the document is selected using hash of document's ID.
You can "control" this by providing the ID used for hashing manually, using routing parameter:
curl -XPOST 'http://localhost:9200/twitter/tweet?routing=kimchy' -d '{...}'
The main advantage of routing is that you can use this information later on when search for documents, by providing the same ID:
curl -XGET 'http://localhost:9200/twitter/tweet/_search?routing=kimchy' -d '{"query":{...}}'
This will restrict the search only to shards that might store the documents for given routing.
Deleting
You can simply delete a document using DELETE HTTP method:
$ curl -XDELETE 'http://localhost:9200/twitter/tweet/1'
To make sure you're deleting document you really want to delete, you can provide version parameter, and the delete will fail if there is a newer version of the document.
Note: If you used routing for creation of the document, you need to provide it for deletion also!
Note: Set replication parameter to async if you want it executed asynchronously; in that case, the operation will return as soon as the document is removed on the primary shard, instead of waiting for update to all replicas.
Getting
You can get document from index by type and ID:
curl -XGET 'http://localhost:9200/twitter/tweet/1'
{ "_index" : "twitter", "_type" : "tweet", "_id" : "1", "_source" : { "user" : "kimchy", "postDate" : "2009-11-15T14:12:12", "message" : "trying out Elastic Search" } }
To simply check if the document exists, use HEAD:
curl -XHEAD 'http://localhost:9200/twitter/tweet/1'
Get Options
The GET API is realtime, that is not affected by the refresh rate of the index (when data will become visible for search). You can disable this by setting realtime parameter to false, or globally by setting action.get.realtime to false.
You can also specify fields parameter to get only selected fields from the document. You may also fetch sub-objects, using obj1.obj2 notation.
curl -XGET 'http://localhost:9200/twitter/tweet/1?fields=title,content'
Type of the document is optional, and passing _all as type for the GET call will return all documents from index based on ID.
The routing parameter is used as usual to specify shard explicitly.
Multi Get
The _mget interface allows you to get multiple documents based on an index, (optionally) type and id (and possibly routing).
From various indices
curl 'localhost:9200/_mget' -d '{ "docs" : [ { "_index" : "test", "_type" : "type", "_id" : "1" }, { "_index" : "test", "_type" : "type", "_id" : "2" } ] }'
From same index, various types
curl 'localhost:9200/test/_mget' -d '{ "docs" : [ { "_type" : "type", "_id" : "1" }, { "_type" : "type", "_id" : "2" } ] }'
Same index, same type
curl 'localhost:9200/test/type/_mget' -d '{ "ids" : ["1", "2"] }'
You can also specify fields to fetch:
curl 'localhost:9200/_mget' -d '{ "docs" : [ { "_index" : "test", "_type" : "type", "_id" : "1", "fields" : ["field1", "field2"] }, { "_index" : "test", "_type" : "type", "_id" : "2", "fields" : ["field3", "field4"] } ] }'
Updating
The operation gets the document, runs the script, and indexes back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note: full document reindex is still needed, just cutting down the network round-trip and using versioning to avoid conflicts.
Example
# create a document curl -XPUT localhost:9200/test/type1/1 -d '{ "counter" : 1, "tags" : ["red"] }' # update the counter - increment by 4 curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ "script" : "ctx._source.counter += count", "params" : { "count" : 4 } }' # add a tag (might add a duplicate, since it's just a list!) curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ "script" : "ctx._source.tags += tag", "params" : { "tag" : "blue" } }' # add a new field curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ "script" : "ctx._source.text = \"some text\"" }' # remove a field curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ "script" : "ctx._source.remove(\"text\")" }' # delete documents with tag 'blue', or ignore/noop curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ "script" : "ctx._source.tags.contains(tag) ? ctx.op = \"delete\" : ctx.op = \"none\"", "params" : { "tag" : "blue" } }' # "full" document may also be POST'ed, causing the final document to be a merge of original and update curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ "doc" : { "name" : "new_name" } }' # if field does not exist, its "default" may be specified as 'upsert' curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{ "script" : "ctx._source.counter += count", "params" : { "count" : 4 }, "upsert" : { "counter" : 1 } }'
Update Parameters
- routing - select routing/shard
- timeout - timeout waiting for a shard to become available
- replication - the replication type for the delete/index operation (sync or async)
- consistency - the write consistency of the index/delete operation
- percolate - enables percolation and filters out which percolator queries will be executed
- refresh - refresh the index immediately after the operation occurs, so that the updated document appears in search results immediately (might increase load and network load)
- fields - return the relevant fields from the document updated; use _source to return the full updated source
- retry_on_conflict - how many times to retry if there is a version conflict; default 0