How to use Recovery for fragment allocation of Elasticsearch index 10/24 Update SLTechnology News&Howtos

How to use Recovery for fragment allocation of Elasticsearch index

2025-10-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use the fragment allocation Recovery of Elasticsearch index". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to use the fragment allocation Recovery of Elasticsearch index.

What is recovery?

In elasticsearch, recovery refers to the process of assigning shards of an index to another node, which usually occurs when snapshots are restored, changes to index replication shards, node failures or restarts. Because master nodes keep state information related to the entire cluster, you can determine which shards need to be redistributed and which nodes need to be redistributed to which nodes, for example:

If a master shard exists and the node where the replication shard resides dies, master needs to select another available node, assign the replication shard of the master shard to the available node, and then replicate the data of the master-slave shard.

If the node where a master shard is located dies and the replication shard is still there, master will promote the replication shard to the master shard, and then do the master-slave data replication.

If both the primary and secondary shards of a shard are dead, it cannot be restored for the time being. Instead, master can not preside over the data recovery operation until the node holding the relevant data rejoins the cluster.

However, the recovery process consumes additional resources, such as CPU, memory, network bandwidth between nodes, and so on. It may lead to the decline of the service performance of the cluster, and even some functions are temporarily unavailable, so it is necessary to understand the process and related configuration in recovery to reduce unnecessary consumption and problems.

Reduce data copy back and forth caused by cluster full restart

Sometimes, you may encounter the overall restart of the es cluster, such as hardware upgrade, force majeure accident, etc., then restarting the cluster will bring a problem: some nodes give priority to and elect the master node first, and with the master node, the master node will immediately host the recovery process.

But at this time, the cluster data is not complete (there are other nodes not up). For example, the B node where the replication shard corresponding to the main shard of node An is not up yet. However, the master node will re-copy several master shards of node A that do not replicate shards to the available C node. When node B is successful, it is found that the replication shard corresponding to the main shard of node A stored in its own node has already appeared on node C. it will directly delete the "invalid" data in its own node (the replication shards of node A). This situation is likely to occur frequently in clusters with multiple nodes.

When the whole cluster is restored, the data distribution of each node is obviously uneven (the first startup node recovers the data, and then the invalid data is deleted in the node). At this time, master will trigger the process of Rebalance to move the data between the nodes, which consumes a lot of network traffic. Therefore, we need to reasonably set the relevant parameters of recovery to optimize the recovery process.

During the cluster startup process, once how many nodes start successfully, the recovery process is executed, which counts both the master node (the master qualified node) and the data node.

Gateway.expected_nodes: 3

If several master nodes are started successfully, the recovery process is executed.

Gateway.expected_master_nodes: 3

If several data nodes are started successfully, the recovery process is executed.

Gateway.expected_data_nodes: 3

When the cluster waits for the time specified by gateway.recover_after_time before the expected number of nodes is met, the recovery process waits for the time specified by gateway.recover_after_time. Once the wait times out, it determines whether to execute the recovery process based on the following conditions:

Gateway.recover_after_nodes: 3 # 3 nodes (both master and data nodes are counted) gateway.recover_after_master_nodes: 3 # 3 nodes qualified for master started successfully gateway.recover_after_data_nodes: 3 # 3 nodes qualified for data started successfully

The above three configurations meet one of the procedures that will execute the recovery.

If you have the following configured clusters:

Gateway.expected_data_nodes: 10gateway.recover_after_time: 5mgateway.recover_after_data_nodes: 8

At this time, if 10 data nodes join the cluster within 5 minutes, or more than 8 data nodes join the cluster 5 minutes later, the recovery process will be started.

Reduce data replication between master replicas

If you restart a single node instead of full restart, it will also cause replication between different nodes. To avoid this problem, you can shut down the shard allocation of the cluster before restarting.

PUT _ cluster/settings {"transient": {"cluster.routing.allocation.enable": "none"}}

When the node is restarted, reopen it:

PUT _ cluster/settings {"transient": {"cluster.routing.allocation.enable": "all"}}

In this way, after the node is restarted, the data can be recovered directly from this node as much as possible. However, before the es1.6 version, even if the above measures are taken, there will still be a large number of data copies between primary and secondary shards. On the face of it, this is very incomprehensible. The primary and secondary shard data are completely consistent. After the node is rebooted, it is good to restore the data directly from the copy of the node. Why do you want to copy it again from the primary shard? The reason is that recovery simply compares the segment file (segmented files) of primary and secondary fragments to determine which data consistency can be recovered locally and which inconsistencies need to be re-copied. On the other hand, the segment file of different nodes runs completely independently, which may cause the depth of the master copy merge to be different, resulting in a different segment file even if the document set is exactly the same.

In order to solve this problem, a new synced flush (synchronous refresh) feature has been added after the es1.6 version. For a shard that has not been updated for 5 minutes, it will automatically synced flush. In fact, a synced flush id is added to the corresponding shard. After the node is restarted, you can first compare the synced flush id of the primary and secondary shard to know whether the two shard are exactly the same, avoiding unnecessary segment file copies.

It should be noted that synced flush is only valid for cold indexes, but not for hot indexes (there are updated indexes within 5 minutes). If the rebooted node contains hot indexes, there will be a large number of copies. If you want to restart a node that contains a large number of hot indexes, you can perform the restart process by following these steps, so that the recovery process can be completed instantly:

Pause data writing

Turn off the shard allocation of the cluster

Execute POST / _ flush/synced manually

Restart the node

Restart the shard allocation of the cluster

Wait for the recovery to complete, when the health status of the cluster is green

Re-open data writing

Why the recovery of extra-large hot index is slow

For cold indexes, because the data is no longer updated (for elasticsearch, 5 minutes, for a long time), you can quickly recover data locally with synced flush, while for hot indexes, especially those with large shard, translog recovery may be a more important cause of slowness, except that synced flush is not useful, which requires a large number of cross-node copies of segment file.

Let's find out what this translog recovery is!

When the node is restarted, it takes three stages to recover data from the main shard to replicate the shard:

In the first stage, take a snapshot of the segment file on the main part, and then copy it to the node where the replication part resides. During the data copy, the index request will not be blocked, and the new index operation will be recorded in the translog (understood as temporary file).

In the second stage, a snapshot is made for translog, which contains the new index request in the first stage, and then replays the index operation in the snapshot. At this stage, the index request is still not blocked, and the new index operation is recorded in translog.

In the third stage, in order to achieve the full synchronization of primary and secondary fragments, block the new index request, and then replay the translog operation added in the previous stage.

Thus it can be seen that translog cannot be cleared until the recovery process is completed. If the shard is large, the first stage will take a long time, which will result in a large translog generated in this stage, and it will take longer to replay the translog than a simple file copy, so the translog time of the second stage will increase significantly. By the third stage, there may be more translog to be replayed than the second phase. Crucially, the third phase will block new index (write) requests, which can lead to performance degradation and affect the user experience in situations with high real-time requirements for writes. Therefore, to speed up the recovery of extremely hot indexes, it is best to refer to the way in the previous section:

Pauses data writing.

Manual synced flush.

Wait for the data recovery to complete.

Resumes data writes.

This minimizes the impact of data latency.

At this point, I believe you have a deeper understanding of "how to use the fragment allocation Recovery of Elasticsearch index". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.