Redis source code parsing: manual failover of cluster, detailed explanation of node migration 07/02 Update SLTechnology News&Howtos

Redis source code parsing: manual failover of cluster, detailed explanation of node migration

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

One: manual failover

Redis clusters support manual failover. That is to send a "CLUSTER FAILOVER" command to the slave node to initiate a failover process and upgrade to a new master node when the master node is not offline, while the original master node is degraded to a slave node.

In order not to lose data, the process is as follows after sending the "CLUSTER FAILOVER" command to the slave node:

A: after receiving the command from the slave node, send the CLUSTERMSG_TYPE_MFSTART packet to the master node

B: after receiving the packet, the master node will put all its clients in a blocking state, that is, within 10 seconds, it will no longer process the commands sent by the client, and the heartbeat packet it sends will be marked with CLUSTERMSG_FLAG0_PAUSED.

C: get the current replication offset of the master node after receiving the heartbeat packet marked by CLUSTERMSG_FLAG0_PAUSED from the master node. The node waits until its own replication offset reaches this value before starting the failover process: initiating the election, counting votes, winning the election, upgrading the primary node and updating the configuration

The "CLUSTER FAILOVER" command supports two options: FORCE and TAKEOVER. Using these two options, you can change the above process.

If there is a FORCE option, the slave node does not interact with the master node and the master node does not block its client, but starts the failover process immediately from the slave node: initiating elections, counting votes, winning elections, promoting master nodes, and updating configurations.

If there is an TAKEOVER option, it is even more simple and rude: the slave node no longer initiates the election, but directly promotes itself to the master node, takes over the slot of the original master node, and updates the configuration after adding its own configEpoch.

Therefore, with the FORCE and TAKEOVER options, the primary node can be offline; instead of using any option, if only the "CLUSTER FAILOVER" command is sent, the primary node must be online.

In the clusterCommand function, the part of the code that handles the "CLUSTER FAILOVER" command is as follows:

Else if (! strcasecmp (c-> argv [1]-> ptr, "failover") & & (c-> argc = = 2 | c-> argc = = 3) {/ * CLUSTER FAILOVER [FORCE] * / int force = 0, takeover = 0; if (c-> argc = = 3) {if (! strcasecmp (c-> argv [2]-> ptr, "force")) {force = 1 } else if (! strcasecmp (c-> argv [2]-> ptr, "takeover") {takeover = 1; force = 1; / * Takeover also implies force. * /} else {addReply (cmaine shared.syntaxerr); return;}} / * Check preconditions. * / if (nodeIsMaster (myself)) {addReplyError (c, "You should send CLUSTER FAILOVER to a slave"); return;} else if (myself- > slaveof = = NULL) {addReplyError (c, "I'm a slave but my master is unknown to me"); return } else if (! force & & (nodeFailed (myself- > slaveof) | | myself- > slaveof- > link = = NULL) {addReplyError (c, "Master is down or failed,"please use CLUSTER FAILOVER FORCE"); return;} resetManualFailover (); server.cluster- > mf_end = mstime () + REDIS_CLUSTER_MF_TIMEOUT; if (takeover) {/ * A takeover does not perform any initial check. It just * generates a new configuration epoch for this node without * consensus, claims the master's slots, and broadcast the new * configuration. * / redisLog (REDIS_WARNING, "Taking over the master (user request)."); clusterBumpConfigEpochWithoutConsensus (); clusterFailoverReplaceYourMaster ();} else if (force) {/ * If this is a forced failover, we don't need to talk with our * master to agree about the offset. We just failover taking over * it without coordination. * / redisLog (REDIS_WARNING, "Forced failover user request accepted."); server.cluster- > mf_can_start = 1;} else {redisLog (REDIS_WARNING, "Manual failover user request accepted."); clusterSendMFStart (myself- > slaveof);} addReply (cMagneShared.ok);}

First check whether the last parameter of the command is FORCE or TAKEOVER

If the current node is a master node, or if the current node is a slave node but does not have a master node, or if the master node of the current slave node has been offline or broken, and there is no FORCE or TAKEOVER parameter in the command, it will be returned directly after replying to the client error message.

Then call resetManualFailover to reset the status of manually forced failover

Set mf_end to the current time plus 5 seconds. This attribute indicates the timeout of the manual forced failover process, as well as whether a manual forced failover is currently in progress.

If the last parameter of the command is TAKEOVER, this means that the slave node receiving the command directly takes over the slot of its master node and becomes the new master node without going through the election process. So first call the function clusterBumpConfigEpochWithoutConsensus to generate a new configEpoch so that the configuration can be updated later, then call the clusterFailoverReplaceYourMaster function to change to the new master node, and broadcast the transition to all nodes in the cluster

If the last parameter of the command is FORCE, this means that the slave node receiving the command can start the election process directly without having to reach the copy offset of the master node before starting the election process. Therefore, set mf_can_start to 1 so that in the function clusterHandleSlaveFailover, even if the master node is not offline or the replication data of the current slave node is relatively old, the failover process can be started.

If the last parameter is not FORCE or TAKEOVER, this means that the slave node that received the command needs to send the CLUSTERMSG_TYPE_MFSTART packet to the master node first, so call the clusterSendMFStart function to send the packet to its master node

After the master node receives the CLUSTERMSG_TYPE_MFSTART packet, it is handled as follows in the clusterProcessPacket function:

Else if (type = = CLUSTERMSG_TYPE_MFSTART) {/ * This message is acceptable only if I'm a master and the sender * is one of my slaves. * / if (! sender | | sender- > slaveof! = myself) return 1; / * Manual failover requested from slaves. Initialize the state * accordingly. * / resetManualFailover (); server.cluster- > mf_end = mstime () + REDIS_CLUSTER_MF_TIMEOUT; server.cluster- > mf_slave = sender; pauseClients (mstime () + (REDIS_CLUSTER_MF_TIMEOUT*2)); redisLog (REDIS_WARNING, "Manual failover requested by slave% .40s.", sender- > name);}

If the sending node cannot be found in the dictionary, or if the primary node of the sending node is not the current node, return directly

Call resetManualFailover to reset the status of manually forced failover

Then set mf_end to the current time plus 5 seconds. This attribute represents the timeout of the manual forced failover process and is also used to indicate whether a manual forced failover is currently in progress.

Then set mf_slave to sender, which represents the slave node for manual forced failover

Then call pauseClients to block all clients within the next 10 seconds

When the master node sends a heartbeat packet, when building the header, if it finds that it is currently in the stage of manual forced failover, it will add a CLUSTERMSG_FLAG0_PAUSED flag to the header:

Void clusterBuildMessageHdr (clusterMsg * hdr, int type) {... / * Set the message flags. * / if (nodeIsMaster (myself) & & server.cluster- > mf_end) hdr- > mflags [0] | = CLUSTERMSG_FLAG0_PAUSED;.}

The slave node processes the received packet in the clusterProcessPacket function. Once the packet with the CLUSTERMSG_FLAG0_PAUSED tag sent by the master node is found, the copy offset of the master node will be recorded in server.cluster- > mf_master_offset:

Int clusterProcessPacket (clusterLink * link) {... / * Check if the sender is a known node. * / sender = clusterLookupNode (hdr- > sender); if (sender & &! nodeInHandshake (sender)) {. / * Update the replication offset info for this node. * / sender- > repl_offset = ntohu64 (hdr- > offset); sender- > repl_offset_time = mstime (); / * If we are a slave performing a manual failover and our master * sent its offset while already paused, populate the MF state. * / if (server.cluster- > mf_end & & nodeIsSlave (myself) & & myself- > slaveof = = sender & & hdr- > mflags [0] & CLUSTERMSG_FLAG0_PAUSED & & server.cluster- > mf_master_offset = = 0) {server.cluster- > mf_master_offset = sender- > repl_offset RedisLog (REDIS_WARNING, "Received replication offset for paused"master manual failover:% lld", server.cluster- > mf_master_offset);}}

From the node in the cluster timer function clusterCron, the clusterHandleManualFailover function will be called to determine that once the replication offset of the current slave node reaches server.cluster- > mf_master_offset, server.cluster- > mf_can_start will be set to 1. In this way, the failover process starts immediately in the next clusterHandleSlaveFailover function to be called.

The code for the clusterHandleManualFailover function is as follows:

Void clusterHandleManualFailover (void) {/ * Return ASAP if no manual failover is in progress. * / if (server.cluster- > mf_end = = 0) return; / * If mf_can_start is non-zero, the failover was already triggered so the * next steps are performed by clusterHandleSlaveFailover () * / if (server.cluster- > mf_can_start) return; if (server.cluster- > mf_master_offset = = 0) return; / * Wait for offset... * / if (server.cluster- > mf_master_offset = = replicationGetSlaveOffset ()) {/ * Our replication offset matches the master replication offset * announced after clients were paused. We can start the failover. * / server.cluster- > mf_can_start = 1; redisLog (REDIS_WARNING, "All master replication stream processed,"manual failover can start.");}}

No matter whether it is the slave node or the master node, the manualFailoverCheckTimeout function will be called in the cluster timer function clusterCron. Once it is found that the timeout period for manual failover has expired, the status of manual failover will be reset to terminate the process. The manualFailoverCheckTimeout function code is as follows:

/ * If a manual failover timed out, abort it. * / void manualFailoverCheckTimeout (void) {if (server.cluster- > mf_end & & server.cluster- > mf_end)

< mstime()) { redisLog(REDIS_WARNING,"Manual failover timed out."); resetManualFailover(); } } 二：从节点迁移在Redis集群中，为了增强集群的可用性，一般情况下需要为每个主节点配置若干从节点。但是这种主从关系如果是固定不变的，则经过一段时间之后，就有可能出现孤立主节点的情况，也就是一个主节点再也没有可用于故障转移的从节点了，一旦这样的主节点下线，整个集群也就不可用了。因此，在Redis集群中，增加了从节点迁移的功能。简单描述如下：一旦发现集群中出现了孤立主节点，则某个从节点A就会自动变成该孤立主节点的从节点。该从节点A满足这样的条件：A的主节点具有最多的附属从节点；A在这些附属从节点中，节点ID是最小的（The acting slave is the slave among the masterswith the maximum number of attached slaves, that is not in FAIL state and hasthe smallest node ID）。该功能是在集群定时器函数clusterCron中实现的。这部分的代码如下： void clusterCron(void) { ... orphaned_masters = 0; max_slaves = 0; this_slaves = 0; di = dictGetSafeIterator(server.cluster->

Nodes); while ((de = dictNext (di))! = NULL) {clusterNode * node = dictGetVal (de); now = mstime (); / * Use an updated time at every iteration. * / mstime_t delay; if (node- > flags & (REDIS_NODE_MYSELF | REDIS_NODE_NOADDR | REDIS_NODE_HANDSHAKE)) continue; / * Orphaned master check, useful only if the current instance * is a slave that may migrate to another master. * / if (nodeIsSlave (myself) & & nodeIsMaster (node) & &! nodeFailed (node)) {int okslaves = clusterCountNonFailingSlaves (node); / * A master is orphaned if it is serving a non-zero number of * slots, have no working slaves, but used to have at least one * slave. * / if (okslaves = = 0 & & node- > numslots > 0 & & node- > numslaves) orphaned_masters++; if (okslaves > max_slaves) max_slaves = okslaves; if (nodeIsSlave (myself) & & myself- > slaveof = = node) this_slaves = okslaves;}. If (nodeIsSlave (myself)) {... / * If there are orphaned slaves, and we are a slave among the masters * with the max number of non-failing slaves, consider migrating to * the orphaned masters. Note that it does not make sense to try * a migration if there is no master with at least * two* working * slaves. * / if (orphaned_masters & & max_slaves > = 2 & this_slaves = = max_slaves) clusterHandleSlaveMigration (max_slaves);}.}

The rotational training dictionary server.cluster- > nodes. As long as the node is not the current node, and is not in the state of REDIS_NODE_NOADDR or handshake, the node node will be processed accordingly:

If the current node is a slave node, and the node node is the master node, and the node is not marked as offline, the function clusterCountNonFailingSlaves is called first to calculate the number of slave nodes in which the node node is not offline okslaves. If the okslaves of the node master node is 0 and the number of slots responsible for the master node is not 0, it means that the node master node is an isolated master node, so increase the value of orphaned_masters. If the okslaves of the node master node is greater than max_slaves, change max_slaves to okslaves, so max_slaves records the number of unoffline slave nodes of the master node with the largest number of unoffline slave nodes; if the current node happens to be one of the slave nodes of the node master node, record the okslaves in this_slaves, all of which are preparations for subsequent slave node migration

After training all the nodes in rotation, if there is an isolated master node and the max_slaves is greater than or equal to 2, and the current node happens to be one of the many slave nodes of the master node with the most unoffline slave nodes, then call the function clusterHandleSlaveMigration. If the condition is met, the slave node is migrated, that is, the current slave node is set as the slave node of an isolated master node.

The code for the clusterHandleSlaveMigration function is as follows:

Void clusterHandleSlaveMigration (int max_slaves) {int j, okslaves = 0; clusterNode * mymaster = myself- > slaveof, * target = NULL, * candidate = NULL; dictIterator * di; dictEntry * de; / * Step 1: Don't migrate if the cluster state is not ok. * / if (server.cluster- > state! = REDIS_CLUSTER_OK) return; / * Step 2: Don't migrate if my master will not be left with at least * 'migration-barrier' slaves after my migration. * / if (mymaster = = NULL) return; for (j = 0; j

< mymaster->

Numslaves; jacks +) if (! nodeFailed (mymaster- > slaves [j]) & &! nodeTimedOut (mymaster- > slaves [j]) okslaves++; if (okslaves numslaves = = 0) continue; okslaves = clusterCountNonFailingSlaves (node); if (okslaves = = 0 & & target = = NULL & & node- > numslots > 0) target = node; if (okslaves = = max_slaves) {for (j = 0; j)

< node->

Numslaves; jacks +) {if (memcmp (node- > slaves [j]-> name, candidate- > name, REDIS_CLUSTER_NAMELEN)

< 0) { candidate = node->

Slaves [j];}} dictReleaseIterator (di); / * Step 4: perform the migration if there is a target, and if I'm the * candidate. * / if (target & & candidate = = myself) {redisLog (REDIS_WARNING, "Migrating to orphaned master% .40s", target- > name); clusterSetMaster (target);}}

If the current cluster status is not REDIS_CLUSTER_OK, return directly; if the current slave node does not have a master node, return directly

Next, it is calculated that the master node of the current slave node has the number of unoffline slave nodes okslaves;. If okslaves is less than or equal to the migration threshold server.cluster_migration_barrier, it will be returned directly.

Next, start the rotation training dictionary server.cluster- > nodes, node for each of these nodes:

If the node node is a slave node or is offline, the next node is processed directly; if the node node is not configured with a slave node, the next node is processed directly

Call the clusterCountNonFailingSlaves function to calculate the number of unoffline master nodes of the node node okslaves; if okslaves is 0 and the numslots of the node node is greater than 0, it means that the master node has slave nodes before, but they are all offline, so an isolated master node target is found.

If okslaves is equal to the parameter max_slaves, it means that the node node is the master node with the most unoffline slave nodes, so compare the ID of the current node with the ID of all its slave nodes, and if the name of the current node is larger, set candidate to the slave node with a smaller name; (in fact, you can exit and return directly from here)

After training all the nodes in rotation, if the orphaned node is found and the current node has the smallest node ID, then call clusterSetMaster, set target as the master node of the current node, and start the master-slave replication process.

Third: configEpoch conflict

In a cluster, there is no problem for the master node responsible for different slots to have the same configEpoch, but it is possible that due to human intervention or BUG problems, the master node with the same configEpoch claims to be responsible for the same slot, which is a fatal problem in the distributed system. Therefore, Redis stipulates that all nodes in the cluster must have different configEpoch.

When a slave node is upgraded to a new master node, it gets a new configEpoch that is larger than the configEpoch of all current nodes, so it does not result in a slave node with duplicate configEpoch (because no two slaves win at the same time in an election). However, at the end of the re-sharding process initiated by the administrator, the node moving into the slot will update its own configEpoch without the consent of other nodes, or manually forcing the failover process will also cause the slave node to update the configEpoch without the consent of other nodes, all of which may lead to the situation that multiple master nodes have the same configEpoch.

Therefore, an algorithm is needed to ensure that the configEpoch of all nodes in the cluster is different. This algorithm is implemented as follows: when a master node receives a heartbeat packet from another master node and finds that the configEpoch in the packet is the same as its own configEpoch, it will call the clusterHandleConfigEpochCollision function to solve the problem of configEpoch conflict.

The code for the clusterHandleConfigEpochCollision function is as follows:

Void clusterHandleConfigEpochCollision (clusterNode * sender) {/ * Prerequisites: nodes have the same configEpoch and are both masters. * / if (sender- > configEpoch! = myself- > configEpoch | |! nodeIsMaster (sender) | |! nodeIsMaster (myself)) return; / * Don't act if the colliding node has a smaller Node ID. * / if (memcmp (sender- > name,myself- > name,REDIS_CLUSTER_NAMELEN) currentEpoch++; myself- > configEpoch = server.cluster- > currentEpoch; clusterSaveConfigOrDie (1); redisLog (REDIS_VERBOSE, "WARNING: configEpoch collision with node% .40s." "configEpoch set to llu", sender- > name, (unsigned long long) myself- > configEpoch);}

If the configEpoch of the sending node is not equal to the configEpoch of the current node, or the sending node is not the primary node, or the current node is not the primary node, return directly

If the node ID of the sending node is smaller than the node ID of the current node, return directly

Therefore, a node with a smaller name can get a larger configEpoch, then first increase its own currentEpoch, and then assign the configEpoch to currentEpoch.

In this way, even if there are multiple nodes with the same configEpoch, in the end, only the node with the largest node ID has the same configEpoch, and the other nodes will increase their own configEpoch, and the value will be different, and the node with the smallest NODE ID will eventually have the largest configEpoch.

Summary

The above is about Redis source code analysis in this article: manual failover of cluster, detailed explanation of migration from nodes, interested friends can refer to: detailed analysis of Redis cluster failure, brief description of the difference between Redis and MySQL, Spring AOP implementation of Redis cache database query source code, etc., there are shortcomings, please leave a message, thank friends for your support!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.