What is the impact of relocating on Elasticsearch clusters? 07/11 Update SLTechnology News&Howtos

What is the impact of relocating on Elasticsearch clusters?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "what is the impact of relocating on Elasticsearch clusters". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn "what is the impact of relocating on Elasticsearch clusters?"

Shard-started RPC will preempt more master processing time

After the shard movement ends, the target node initiates a shard-started PRC to the master, which has the second highest priority: URGENT, master is slow to process the RPC, especially when the number of cluster shards reaches tens of thousands, which may cause some lower priority RPC to be delayed for a long time, such as put-mapping, which will have a significant impact on the business.

Shard-started RPC generated by rebalance,allocation filtering,forced awareness and other reasons will have this problem. For example, when expanding a cluster, if the rebalance is paralleled to a larger size, the processing capacity of the master will be significantly affected. Therefore, for clusters with a large number of shards, consider the impact on master when you want to increase the concurrency of rebalance and recovery.

Move a master shard, the impact on the writing process

When a master slice is rebalance or manual move, it is conceivable that there must be a time period in which the main slice cannot be written.

Elasticsearch is also a direct move for the relocating of the primary shard. It will not first cede the qualification of the primary shard to other sub-shards, and then move. Even so, there will be a point in time to switch and cannot respond to the write.

In terms of the timing of the RPC, as shown in the following figure, the area marked in red is the time period that blocks the write process. During this time period, the write request from the client is blocked and has no return until this part of the processing is completed.

Cymbal

Take manual move as an example, when a main shard goes from node-idea move to node1, the following phases are completed:

First, the shard will be marked as RELOCATING status by publishing a new cluster status, and the routing_table and routing_nodes will be updated.

After the data node receives the cluster status, it starts to execute the recovery process, which starts when the target node sends the start_recovery and ends when the Source phase returns the start_recovery response.

After the recovery is completed, the target sends the shards-started RPC to the master node, and the master node sends the cluster status again, marking the shard as STARTED.

The phase of blocking client bulk writes starts when the source node executes the finalize phase and prepares to send handoff RPC, until the new cluster state of master is received and the shard is marked as STARTED. The time for each node to apply the cluster state is slightly different, so there will be a slight difference in the time for each node to stop blocking writes.

The function of the RPC of handoff is to inform the target node that the shard can switch the state of the main shard. The processing of handoff RPC is a number of computing operations, including several locks, which are usually completed quickly.

The impact of the whole relocating process of the main shard on writing is a very complex process. I divide them into several stages. Here is the detailed process.

Detailed principle of blocking process

During the whole relocating process of sharding, the request for the target node will be forwarded to the source node until the target node changes from the shard sent by master to the cluster state of STARTED.

Normal write phase

At first, the write request reaches the source node and writes normally to the main slice as usual until the replication phase.

Replication stage

This phase copies the received write requests to the target node. He starts after receiving the response from prepare_translog until before the handoff phase begins.

1. Add target shards to replication group

After receiving the prepare_translog response, the phase1 phase of the recovery is over, the segments file is sent, and the engine of the target node is started. Target shards are added to the replication group through shard.initiateTracking, and subsequent writes are replicated to the target node.

PrepareEngineStep.whenComplete (prepareEngineTime-> {

RunUnderPrimaryPermit (()-> shard.initiateTracking (request.targetAllocationId ())

ShardId + "initiating tracking of" + request.targetAllocationId (), shard, cancellableThreads, logger)

Final Translog.Snapshot phase2Snapshot = shard.getHistoryOperations ("peer-recovery", historySource, startingSeqNo)

});

Replication group information is maintained in replicationGroup members of ReplicationTracker.

two。 During the write process, copy the write request to the target

In the finishRequest method of the performOnPrimary#doRun () function written to the process, the performOnReplicas method of the ReplicationOperation#handlePrimaryResult function is finally called to send the index to the target node.

Handoff blocking phase

The handoff phase is to complete the transfer of the main shard state. In this stage, the received write requests are put into the queue and continue to be executed after the completion of the master shard state transfer, blocking for up to 30 minutes. Writes to the target node during this period are also performed by the forwarded source node. This phase begins after the source node receives the response of the finalize until the next phase.

1.recovery procedure sets blocking flag

After the source node receives the response of the finalize, the master shard status is handed over through the handoff RPC, and the handover process blocks the write requests received at this time through the indexShardOperationPermits.blockOperations for up to 30 minutes.

Public void relocated (final String targetAllocationId, final Consumer consumer) {

Try {

The scope of indexShardOperationPermits.blockOperations (30, TimeUnit.MINUTES, ()-> {/ / block) is the code in parentheses

Final ReplicationTracker.PrimaryContext primaryContext = replicationTracker.startRelocationHandoff (targetAllocationId)

Try {/ / call RemoteRecoveryTargetHandler.handoffPrimaryContext below

Consumer.accept (primaryContext); / / sent and responded to handoff_primary_context RPC after executing here

}

});

}

The blocking operation in indexShardOperationPermits.blockOperations is only queuedBlockOperations plus 1, and the subsequent writing process will check whether the value of queuedBlockOperations is 0.

two。 Processing of the write process

The acquire function generated on the write link:

Acquire:255, IndexShardOperationPermits (org.elasticsearch.index.shard)

AcquirePrimaryOperationPermit:2764, IndexShard (org.elasticsearch.index.shard)

HandlePrimaryRequest:256, TransportReplicationAction

This function determines that queuedBlockOperations is greater than 0, so it adds him to delayedOperations, and the writing process ends. Operations in delayedOperations are processed at the end of this phase of the blocking process.

Failed retry phase

The failed retry phase begins after the source node processes the handoff response, until the master issues the cluster state marked STARTED, and the data node ends after the cluster state is applied. Before applying the cluster state, the shard is in the relocating state, and the write operation during the period is performed by the source node, and the write fails, throws a ShardNotInPrimaryModeException exception, catches the exception, waits for 1 minute to retry the write, and exits if it fails again. If a new cluster status is received within 1 minute, the write is also retried and the write is successful.

In the writing process, when the acquire function is executed, it is found that queuedBlockOperations is 0 and onAcquired.onResponse (releasable) is executed. When the wrapPrimaryOperationPermitListener function is called, it is found that the shard is no longer the main shard state, and a ShardNotInPrimaryModeException exception is thrown.

If (replicationTracker.isPrimaryMode ()) {

L.onResponse (r)

} else {

R.close ()

L.onFailure (new ShardNotInPrimaryModeException (shardId, state))

}

For the exception returned by the above, the Listener set by the process acquirePrimaryOperationPermit will distinguish the type of exception. If it is an ShardNotInPrimaryModeException exception, wait 1 minute and then try again.

Move a sub-shard, the impact on the writing process

The move process of the secondary shard is similar to the move process of the main shard. It also sends the cluster status twice and replicates it to the target node through recovery.

It should be noted that recovery copies shards to the target node, not from the node where the secondary shard currently resides to target, but from the primary shard to target. For example, the primary shard is on the node node-idea, the secondary shard is on node-1, and when we go from node-1 move to node- 2, recovery is copied from node-idea to node-2. When the shard status published by the master node of the node-1 application becomes the cluster state of STARTED, it is found that the shard already belongs to itself, so the shard of this node is deleted.

Similar to the relocating of the primary shard, the shard being recovery is added to the replication group during the replication phase, so that there are three shards in the replication group, the primary shard, the original shard, and the new subshard from the move to the target node. During the replication phase, write operations are written to these three shards.

Take a manual move sub-shard as an example, from node-1 move to node- 2, when the first cluster status is received, the shard is marked as relocating, so the data goes from node-idea recovery to node-2. During this time, the new index operation is copied from the primary shard node to the other two nodes. This is shown in the following figure.

This process is equivalent to adding a sub-shard first, and performing the recovery process of sub-shard to copy to the new target.

When the recovery is completed, the master sends the second cluster status, and the shards marked as originally held by STARTED,node-1 are deleted, as shown in the following figure.

There is no handoff phase in the relocating process of sub-fragmentation, and the whole relocating process is the same as the recovery process of sub-fragmentation, there is no blocking stage for writes.

At this point, I believe you have a deeper understanding of "what is the impact of relocating on Elasticsearch clusters". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.