How does Reindex API rebuild the index locally 04/22 Update SLTechnology News&Howtos

How does Reindex API rebuild the index locally

2025-04-22 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains how Reindex API rebuilds the index locally. Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how Reindex API rebuilds the index locally.

Rebuild the index locally

Reindex does not attempt to set the target index. It does not copy the setting information of the source index. You should set the target index before running the _ reindex operation, including mapping, number of shards, copies, and so on.

The most basic form of _ reindex is simply copying a document from one index to another. Here is to copy the document from the twitter index to the new_twitter index:

POST _ reindex {"source": {"index": "twitter"}, "dest": {"index": "new_twitter"} this will return information similar to {"took": 147, "timed_out": false, "created": 120, "updated": 0, "deleted": 0, "batches": 1, "version_conflicts": 0, "noops": 0 "retries": {"bulk": 0, "search": 0}, "throttled_millis": 0, "requests_per_second":-1.0, "throttled_until_millis": 0, "total": 120, "failures": []}

Like _ update_by_query, _ reindex takes a snapshot of the source index, but its target index must be a different index, so there are no version conflicts. The dest element can be configured like an index API to control optimistic concurrency control. Simply leave version_type empty (as above) or set version_type to internal and Elasticsearch forcefully dumps the document to the target, overwriting anything of the same type and ID:

POST _ reindex {"source": {"index": "twitter"}, "dest": {"index": "new_twitter", "version_type": "internal"}} setting version_type to external will cause Elasticsearch to retain the version from the source file and create all missing documents And update all documents that are older in the target index than in the source index: POST _ reindex {"source": {"index": "twitter"}, "dest": {"index": "new_twitter", "version_type": "external"}} setting op_type to create will cause _ reindex to create only missing documents in the target index. All existing documents will cause version conflicts: POST _ reindex {"source": {"index": "twitter"}, "dest": {"index": "new_twitter", "op_type": "create"}} result: {"took": 2015, "timed_out": false, "total": 6520, "updated": 0, "created": 885, "deleted": 0, "batches": 1 "version_conflicts": 115,115 "noops": 0, "retries": {"bulk": 0, "search": 0}, "throttled_millis": 0, "requests_per_second":-1, "throttled_until_millis": 0, "failures": [{"index": "sphinx-doctorinfo-20.11.11-162930", "type": "_ doc", "id": "42" "cause": {"type": "version_conflict_engine_exception", "reason": "[_ doc] [42]: version conflict, document already exists (current version [1])", "index_uuid": "z1U5C2-TSXWQtAofQSSuHg", "shard": "0", "index": "sphinx-doctorinfo-20.11.11-162930"} "status": 409}} default Version conflicts will abort the _ reindex process, but you can count in the event of a conflict by setting "conflict": "proceed" in the request body: POST _ reindex {"conflicts": "proceed", "source": {"index": "twitter"}, "dest": {"index": "new_twitter", "op_type": "create"}} you can restrict the document by adding type or query to source. The following will copy the tweet published by kimchy into new_twitter: POST _ reindex {"source": {"index": "twitter", "type": "tweet", "query": {"term": {"user": "kimchy"}}, "dest": {"index": "new_twitter"}}

Both index and type in source can be a list that allows you to copy from a large number of sources in a single request. Now you will copy the document from the tweet and post types in the twitter and blog indexes. It also contains the post type in the twitter index and the tweet type in the blog index. If you want to be more specific, you will need to use query. It also makes no effort to deal with ID conflicts. The target index will remain valid, but because the iteration order is incorrectly defined, it is not easy to predict which document can be saved.

POST _ reindex {"source": {"index": ["twitter", "blog"], "type": ["tweet", "post"]}, "dest": {"index": "all_together"}}

You can also limit the number of documents processed by setting the size. The following will only copy a single document from twitter to new_twitter:

POST _ reindex {"size": 1, "source": {"index": "twitter"}, "dest": {"index": "new_twitter"}}

If you want to get a specific collection of documents from the twitter index, you need to sort. Sorting makes scrolling less efficient, but in some cases it is worth it. If possible, prefer more selective queries for size and sort. This will copy 10000 documents from twitter to new_twitter:

POST _ reindex {"size": 10000, "source": {"index": "twitter", "sort": {"date": "desc"}}, "dest": {"index": "new_twitter"} source section supports all elements supported in the search request. For example, use only some fields of the original document, and use the source filter as follows: POST _ reindex {"source": {"index": "twitter", "_ source": ["user", "tweet"]}, "dest": {"index": "new_twitter"}}

Like update_by_query, _ reindex supports scripts that modify documents. Unlike _ update_by_query, scripts allow you to modify the metadata of a document. This example modifies the version of the source document:

POST _ reindex {"source": {"index": "twitter"}, "dest": {"index": "new_twitter", "version_type": "external"}, "script": {"inline": "if (ctx._source.foo = = 'bar') {ctx._version++" Ctx._source.remove ('foo')} "," lang ":" painless "}} just like in _ update_by_query, you can set ctx.op to change the action performed on the target index: noop if your script decides that no changes are necessary, set ctx.op =" noop ". This will cause _ update_by_query to ignore the document from its updates. This no operation will be reported on the noop counter of the response body. Delete if your script decides that the document must be deleted, set ctx.op= "delete". Deletions will be reported in the deleted counter of the response body. Setting ctx.op to anything else is an error. It is an error to set any other fields in ctx. Think about the possibilities! Just be careful, there's a lot of power. You can change: _ id_type_index_version_routing_parent sets _ version to null or removes from the ctx mapping as if the version was not sent in an index request. This will cause the document in the target index to be overwritten, regardless of the target version or the type of version used in the _ reindex request.

By default, if _ reindex sees a document with a route, the route is retained unless the script is changed. You can set the routing according to the dest request to change:

Keep: sets the route for each match of the batch request to the route on the match. Default value.

Discard: set the route for each match of the batch request to null.

=: set the route of each match of the batch request to the text after `=`.

For example, you can use the following request to copy all documents with the company name cat of the source index to the dest index that is routed to cat.

POST _ reindex {"source": {"index": "source", "query": {"match": {"company": "cat"}}, "dest": {"index": "dest", "routing": "= cat"}

By default, the batch scrolling size of _ reindex is 1000. You can change the batch size by specifying the size field in the source element:

POST _ reindex {"source": {"index": "source", "size": 100 # batch size where machine resources permit, make it bigger}, "dest": {"index": "dest", "routing": "= cat"} 1.ES is not real-time. In Elasticsearch, the real-time performance of Index is controlled by refresh. The default is 1s, and you can get to 100ms as soon as possible, which means that after Index doc is successful, you need to wait a second before it can be searched. 2.reindex consumes performance. With the help of: scroll+bulk. Optimization suggestion: it is recommended to set a large point for this size under re-indexing.

Reindex can also use the [Ingest Node] function to specify pipeline, like this:

POST _ reindex {"source": {"index": "source"}, "dest": {"index": "dest", "pipeline": "some_ingest_pipeline"}} remotely rebuild the index

Reindex supports re-indexing from a remote Elasticsearch cluster:

POST _ reindex {"source": {"host": "http://otherhost:9200"," username ":" user "," password ":" pass "}," index ":" source "," query ": {" match ": {" test ":" data "} "dest": {"index": "dest"}}

The host parameter must contain scheme,host and port (for example, https:// otherhost:9200). The username and password parameters are optional, and when they exist, the index connects to the remote Elasticsearch node using basic authentication. Be sure to use https when using basic authentication. The password will be sent in plain text.

Remote hosts must be explicitly whitelisted in elasticsearch.yaml using the reindex.remote.whitelist attribute. It can be set to a comma-separated list of allowed remote host and port combinations (for example, otherhost:9200,another:9200127.0.10.*:9200,localhost:*). The whitelist ignores scheme-only hosts and ports are used.

This feature should apply to remote clusters of any version of Elasticsearch that you may find. This should allow you to upgrade from any version of Elasticsearch to the current version by re-indexing from the old version of the cluster.

To enable queries sent to older versions of Elasticsearch, the query parameters are sent directly to the remote host without verification or modification.

Reindexing from a remote server uses a stack buffer that defaults to a maximum size of 100mb. If the remote index contains very large documents, you need to use a smaller batch size. The following example sets a very small batch size of 10.

POST _ reindex {"source": {"remote": {"host": "http://otherhost:9200"},"index": "source", "size": 10, "query": {"match": {"test": "data"}}, "dest": {"index": "dest"}}

You can also use the socket_timeout field to set the socket read timeout on a remote connection and the connect_timeout field to set the connection timeout. The default for both is 30 seconds. This example sets the socket read timeout to one minute and the connection timeout to ten seconds:

POST _ reindex {"source": {"host": "http://otherhost:9200"," socket_timeout ":" 1m "," connect_timeout ":" 10s "}," index ":" source "," query ": {" match ": {" test ":" data "} "dest": {"index": "dest"} Reindex API1. URL parameter

In addition to standard parameters like pretty, "Reindex API" supports refresh, wait_for_completion, wait_for_active_shards, timeout, and requests_per_second.

Sending refresh updates all shards in the index when the update request completes. This is different from the refresh parameter of Index API and only causes the shards that receive new data to be indexed.

If the request contains a wait_for_completion=false, the Elasticsearch performs some pre-check, initiates the request, and then returns a task that can be used with Tasks API to cancel or get the status of the task. Elasticsearch will also create a record for this task with .tasks / task/$ {taskId} as the document. This is where you can keep or delete it according to whether it is appropriate. When you finish it, delete it to allow Elasticsearch to reclaim the space it uses.

Wait_for_active_shards controls how many shards must be active before continuing with the request, as detailed here. Timeout controls the amount of time each write request waits for unavailable shards to become available. Both work correctly in Bulk API.

Requests_per_second can be set to any positive number (1.4, 6, 000, etc.) as the throttle number of "delete-by-query" requests per second, or to-1 to disable the limit. Throttling is waiting between batch batches so that it can manipulate scrolling timeouts. The wait time is the difference between the batch completion time and the request_per_second * requests_in_the_batch time. Since batching is not broken down into multiple batch requests, it causes Elasticsearch to create many requests and then wait a while before starting the next group. This is "sudden" rather than "smooth". The default value is-1.

two。 Response body

The JSON response is similar to the following:

{"took": 639, "updated": 0, "created": 123, "batches": 1, "version_conflicts": 2, "retries": {"bulk": 0, "search": 0} "throttled_millis": 0, "failures": []}

Took: the number of milliseconds from the beginning to the end of the entire operation.

Updated: the number of documents updated successfully.

Upcreateddated: the number of documents successfully created.

Batches: the number of scrolling responses updated by the query.

Version_conflicts: based on the number of version conflicts when updating the query.

Retries: the number of retries based on the query update. Bluk is the number of batch operations retried, and search is the number of search operations retried.

Throttled_millis: the number of milliseconds for requesting sleep, which is the same as `requests_per_ second`.

Failures: failed index array. If this is not empty, the request is aborted because of these failures. See conflicts for how to prevent version conflicts from aborting operations.

3. Use with Task API

You can use Task API to get the status of any running re-indexing request:

The GET _ tasks?detailed=true&actions=*/update/byquery response will be similar to the following: {"nodes": {"r1A2WoRbTwKZ516z6NEs5A": {"name": "r1A2WoR", "transport_address": "127.0.0.1 nodes", "host": "127.0.0.1", "ip": "127.0.0.1 r1A2WoRbTwKZ516z6NEs5A" "attributes": {"testattr": "test", "portsfile": "true"}, "tasks": {"r1A2WoRbTwKZ516z6NEs5A:36619": {"node": "r1A2WoRbTwKZ516z6NEs5A", "id": 36619, "type": "transport", "action": "indices:data/write/reindex" "status": {/ / ① "total": 6154, "updated": 3500, "created": 0, "deleted": 0, "batches": 4, "version_conflicts": 0, "noops": 0 "retries": {"bulk": 0, "search": 0}, "throttled_millis": 0}, "description": ""}

① this object contains the actual state. It's like responding to json, adding an important total field. Total is the total number of operations expected to be performed by re-indexing. You can estimate progress by adding fields for updated, created, and deleted. When their sum is equal to the total field, the request will be completed.

Using the task id, you can find the task directly:

GET / _ tasks/taskId:1

The advantage of this API is that it integrates with wait_for_completion=false to transparently return the status of completed tasks. If the task is completed and wait_for_completion=false is set, it returns the results or error field. The cost of this feature is the document created by wait_for_completion=false in .tasks / task/$ {taskId}, which you delete yourself.

4. Use with cancel task API

All reindexing can be canceled using Task Cancel API:

POST _ tasks/task_id:1/_cancel can use the above task API to find task_id.

The cancellation should occur as soon as possible, but it may take a few seconds. The above task status API will continue to list the task until it is awakened to cancel itself.

5. Reset throttle valve

The value of request_per_second can be changed using _ rethrottle API when deleted through a query:

POST _ update_by_query/task_id:1/_rethrottle?requests_per_second=-1 can use the above task API to find task_id.

Just as you set it in _ update_by_query API, request_per_second can be-1 to disable the limit, or any decimal number, such as 1.7 or 12, to control to that level. Those who speed up the query will take effect immediately, but those that slow down the query will not take effect until the current batch is completed. This prevents scrolling from timeout.

6. Modify field name

_ reindex can be used to build a copy of the index using renamed fields. Suppose you create an index that contains the following documents:

POST test/test/1?refresh {"text": "words words", "flag": "foo"} but you don't like the flag name, you want to replace it with tag. _ reindex can create other indexes for you: POST _ reindex {"source": {"index": "test"}, "dest": {"index": "test2"}, "script": {"inline": "ctx._source.tag = ctx._source.remove (\" flag\ ")"}} now you can get a new file: GET test2/test/1 {"found": true "_ id": "1", "_ index": "test2", "_ type": "test", "_ version": 1, "_ source": {"text": "words words", "tag": "foo"}} or you can do any search you want through tag. 7. Manual slicing

Re-indexing supports scrolling slices, which you can parallelize manually relatively easily:

POST _ reindex {"source": {"index": "twitter", "slice": {"id": 0, "max": 2}}, "dest": {"index": "new_twitter"}} POST _ reindex {"source": {"index": "twitter", "slice": {"id": 1, "max": 2}} "dest": {"index": "new_twitter"}} you can verify that the result of GET _ refreshPOST new_twitter/_search?size=0&filter_path=hits.total is a reasonable total like this: {"hits": {"total": 120}} 8. Automatic slicing

You can also have the reindexing use the slice's _ uid to automatically scroll the slice in parallel.

POST _ reindex?slices=5&refresh {"source": {"index": "twitter"}, "dest": {"index": "new_twitter"}} you can verify: POST new_twitter/_search?size=0&filter_path=hits.total results a reasonable total like this: {"hits": {"total": 120}}

Adding slices to _ reindex automates the manual process used in the above section to create child requests, which means it has some quirks:

You can see these requests in Task API. These subrequests are "child" tasks with slices request tasks.

The status of the get slices request task contains only the status of the completed slices.

These subrequests can be addressed separately, such as canceling and resetting the throttle.

Slices's reset throttle request will recalculate outstanding subrequests accordingly.

The cancel request from slices cancels each child request.

Because of the nature of slices, each subrequest will not get a completely uniform portion of the document. All files will be processed, but some pieces may be larger than others. Larger slices are expected to be more evenly distributed.

The parameters of request_per_second and size with slices requests are assigned to each child request accordingly. Combined with the above uneven distribution, you should conclude that the use of slice size may not result in the correct size document being _ reindex.

Each subrequest gets a slightly different snapshot of the source index, although these are all at roughly the same time.

9. Select the number of slices

At this point, we provide some suggestions around the number of slices to use (such as the max parameter in the slice API for manual parallelization):

Do not use large numbers, 500 can cause considerable CPU jitter.

From a query performance point of view, it is more efficient to use multiples of the number of fragments in the source index.

Using exactly the same shard in the source index is most efficient from a query performance point of view.

Index performance should scale linearly by the number of slices between available resources.

Whether index or query performance dominates the process depends on a number of factors, such as the documents being re-indexed and the clusters that are reindexing.

10. Daily reconstruction of the index

You can use a combination of _ reindex and Painless to re-index daily to apply the new template to existing documents. Suppose you have an index of the following files:

PUT metricbeat-2016.05.30/beat/1?refresh {"system.cpu.idle.pct": 0.908} PUT metricbeat-2016.05.31/beat/1?refresh {"system.cpu.idle.pct": 0.105}

The new template for the metricbeat-* index has been loaded into Elaticsearch, but it applies only to newly created indexes. Painless can be used to re-index existing documents and apply new templates.

The following script extracts the date from the index name and creates a new index with-1 attached. All data from metricbeat-2016.05.31 will be re-indexed to metricbeat-2016.05.31-1.

POST _ reindex {"source": {"index": "metricbeat-*"}, "dest": {"index": "metricbeat"}, "script": {"lang": "painless", "inline": "ctx._index = 'metricbeat-' + (ctx._index.substring (' metricbeat-'.length (), ctx._index.length ()) +'- 1'"}

All documents from the previous metric index can now be found in the *-1 index.

GET metricbeat-2016.05.30-1/beat/1GET metricbeat-2016.05.31-1/beat/1

The previous method can also be used with changing the name of the field to load existing data into the new index, but you can also rename the field if necessary.

11. Extract a random subset of the index

Reindex can be used to extract a random subset of the index used for testing:

POST _ reindex {"size": 10, "source": {"index": "twitter", "query": {"function_score": {"query": {"match_all": {}}, "random_score": {}}, "sort": "_ score" / / ①} "dest": {"index": "random_twitter"}}

Reindex is sorted by _ doc by default, so random_score has no effect unless you rewrite the sort as _ score.

At this point, I believe you have a deeper understanding of "how to rebuild the index locally by Reindex API". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.