The method of Elasticsearch Recovery Index fragment allocation 07/19 Update SLTechnology News&Howtos

The method of Elasticsearch Recovery Index fragment allocation

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "the method of Elasticsearch Recovery index fragment allocation". Interested friends may wish to have a look at it. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn the "Elasticsearch Recovery index fragment allocation method"!

Basic knowledge points

In Eleasticsearch, recovery refers to the process of assigning a shard of an index to another node; it usually occurs when a snapshot is restored, the number of index copies changes, a node fails, and a node restarts. Because master holds the state information of the entire cluster, you can determine which shard needs to be reassigned and to which node, for example:

If a shard master shard is dead at the node where the secondary shard is located, select another available node, allocate the secondary shard, and then copy the master-slave shard.

If the primary shard node of a shard is dead and the secondary shard is still there, upgrade the secondary shard to the primary shard, and then copy the master-slave shard.

If the primary and secondary shards of a shard are all dead, it cannot be recovered temporarily. Wait for the node holding the relevant data to rejoin the cluster, restore the primary shards from that node, and then select another node to replicate the secondary shards.

Normally, we can check the health status and data integrity of the entire cluster through the API API of ES's health:

The status and meaning are as follows:

Green: all shard primary and secondary fragments are normal

Yellow: all shard primary shards are intact, some sub-shards are missing or incomplete, and data integrity is still intact.

Red: the primary and secondary shards of some shard are gone, and the corresponding index data is incomplete.

The recovery process consumes additional resources, such as CPU, memory, network bandwidth between nodes, and so on. These additional resource consumption may lead to a decline in the service performance of the cluster, or some functions are temporarily unavailable. Understanding some recovery process and related configuration parameters is very helpful to reduce the resource consumption caused by recovery and speed up the cluster recovery process.

Reduce data copy back and forth caused by cluster Full Restart

There may be an overall restart of the ES cluster, such as the need to upgrade hardware, operating systems, or major versions of ES. A problem that may be caused by restarting all nodes: some nodes may join the cluster before others, and the nodes that join the cluster first may already be able to elect master and immediately start the process of recovery. Because the data of the whole cluster is incomplete, master will instruct some nodes to start replicating data with each other. For those late nodes, once it is found that the local data has been copied to other nodes, the local "invalid" data is deleted directly. When the whole cluster is restored, the data distribution is uneven, which is obviously uneven. Master will trigger the rebalance process and move the data between nodes. The whole process needlessly consumes a lot of network traffic; reasonable setting of recovery-related parameters can prevent the occurrence of this problem.

Gateway.expected_nodesgateway.expected_master_nodesgateway.expected_data_nodes

The above three parameters mean that the recovery process starts as soon as there are many nodes in the cluster. The difference is that the first parameter refers to either master or data, while the last two parameters refer to master and data node, respectively.

Before the expected node count condition is met, the recovery process will wait such a long time for gateway.recover_after_time (default is 5 minutes). Once the wait times out, it will determine whether to start based on the following conditions:

Gateway.expected_nodesgateway.expected_master_nodesgateway.expected_data_nodes

For example, for a cluster with 10 data node, if you have the following settings:

Gateway.expected_data_nodes: 10gateway.recover_after_time: 5mgateway.recover_after_data_nodes: 8

If the cluster joins 10 data node within 5 minutes, or more than 8 data node join 5 minutes later, the recovery process will be started immediately.

Reduce data replication between master replicas

If you restart a single data node instead of full restart, it will still cause data to replicate back and forth between different nodes. To avoid this problem, shut down the shard allocation of the cluster before restarting:

Then, after the node restarts, join the cluster, and then reopen it:

In this way, after the reboot of the node is completed, recover the data directly from the local as much as possible.

But before the ES1.6 version, even with the above measures, you will still find that there are a large number of data copies between master replicas. On the face of it, this is incomprehensible.

The data of the master copy is exactly the same. It would be fine for ES to restore the data directly from the replica locally. Why copy it again from the master? The reason is that recovery simply compares the segment file of the master copy to determine which data consistency can be recovered locally and which inconsistencies need to be copied remotely.

On the other hand, the segment merge of different nodes runs completely independently, which may cause the depth of the master copy merge to be not the same, so that even if the document set is exactly the same, the resulting segment file is not exactly the same.

To solve this problem, new features of synced flush have been added to the ES1.6 version. For the shard that has not been updated for 5 minutes, it will automatically synced flush, which is essentially adding a synced flush ID to the corresponding shard. In this way, when restarting the node, you can first compare the synced flush ID of shard to see whether the two shard are exactly the same, avoiding unnecessary segment file copies and greatly speeding up the recovery of cold indexes.

It is important to note that synced flush is only valid for cold indexes, not for hot indexes (indexes are updated within 5 minutes). If the restart node contains a hot index, then a large number of file copies are inevitable.

. Therefore, before restarting a node, it is best to follow these steps. Recovery can be done almost instantly:

Pause the data writer

Shut down the cluster shard allocation

Execute POST / _ flush/synced manually

Restart the node

Restart the cluster shard allocation

Wait for the recovery to complete, and the cluster health status becomes green

Reopen the data writer

Why the recovery of extra-large hot index is slow

For cold indexes, because the data is no longer updated, using the synced flush feature, you can quickly and directly recover the data locally. For hot indexes, especially those with large shard, translog recovery is a more important cause of slowness, except that synced flush is not useful and requires a large number of copies of segment file across nodes.

You need to go through three stages to recover data from the primary film to the secondary film:

Take a snapshot of the segment file on the master and copy it to the node to which the replica is assigned. During the data copy, the index request is not blocked, and the new index operation is recorded in translog.

Take a snapshot of the translog, which contains the new index request in the first phase, and then replay the index operation in the snapshot. Index requests are still not blocked at this stage, and new index operations are recorded in translog.

In order to achieve full synchronization of the primary and secondary slices, block the new index request, and then replay the new translog operation in phase 2.

It can be seen that translog cannot be cleared until recovery is complete (disable background flush operations during normal operation).

If the shard is large, the first phase takes a long time, resulting in a large translog generated by this phase. Replaying translog takes much longer than a simple file copy, so the second phase of translog takes significantly more time.

By the third phase, there may be more translog to be replayed than the second phase. The third stage will block the writing of new index, which will greatly affect the user experience when the real-time requirement of writing is very high.

Therefore, the best way to speed up the recovery of large hot indexes is to follow the method mentioned in the previous section: pause new data writes, manually sync flush, wait for data recovery to complete, and restart data writes, so that the impact of data latency can be minimized.

What if Recovery is slow and you want to know the progress? CAT Recovery API can display the status of each phase of recovery in detail. How to use this API will not be repeated here, refer to: CAT Recovery.

Other Recovery related expert settings

There are other expert settings (see: recovery) that can affect the speed of recovery, but the cost of increasing the speed is more resource consumption, so adjusting these parameters on the production cluster needs to be adjusted carefully according to the actual situation, and should be adjusted immediately as soon as it affects the application.

For situations where the search concurrency requirement is high and the delay requirement is low, the default setting generally does not move.

For the real-time log analysis class, the requirement for search delay is not high, but for the situation where the expectation of data writing delay is low, we can appropriately increase the indices.recovery.max_bytes_per_sec, improve the recovery speed and reduce the blocking time of data writing.

Finally, the version of ES iterates quickly, and the mechanism for Recovery is constantly being optimized. Some of these versions even introduce some bug, such as serious translog recovery bug in ES1.4.x, which makes it almost impossible to complete a large index translog recovery.

Therefore, if you encounter problems in actual use, it is best to search the issue list of Github to see if anyone else reflects the same problem in the version you are using.

At this point, I believe that you have a deeper understanding of the "Elasticsearch Recovery index fragmentation allocation method", might as well come to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.