How to understand Redis Cluster Gossip protocol 07/06 Update SLTechnology News&Howtos

How to understand Redis Cluster Gossip protocol

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains "how to understand the Redis Cluster Gossip protocol". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to understand the Redis Cluster Gossip protocol".

Introduction to Cluster Mode and Gossip

For the field of data storage, when the amount of data or request traffic is large to a certain extent, it is bound to introduce distribution. For example, Redis, although its stand-alone performance is excellent, but also have to introduce clusters for the following reasons.

High availability cannot be guaranteed on a single machine, so you need to introduce multiple instances to provide high availability QPS of about 8W. For a high QPS, you need to introduce multiple instances to support limited data, and to handle more data, you need to introduce multiple instances. The network traffic handled by a single machine has exceeded the upper limit of the server's network card, so you need to introduce multiple instances to divert it.

With clusters, clusters often need to maintain certain metadata, such as ip addresses of instances, slots information of cache fragments, etc., so a set of distributed mechanism is needed to maintain the consistency of metadata. There are generally two modes of such mechanisms: decentralized and centralized.

The decentralized mechanism stores metadata on some or all nodes, and different nodes communicate continuously to maintain the change and consistency of metadata. Redis Cluster,Consul and so on are all in this mode.

Centralized is to store cluster metadata centrally on external nodes or middleware, such as zookeeper. Older versions of kafka, storm, and so on all use this mode.

Each of the two modes has its advantages and disadvantages, as shown in the following table:

The advantages and disadvantages of the mode are timely update of centralized data, good timeliness, update and reading of metadata, and good timeliness. Once the metadata is changed, it is immediately updated to the centralized external node. Other nodes can immediately feel it when they read it. Compared with big data update pressure, update pressure is all concentrated on external nodes, as a single point affecting the whole system decentralized data update pressure is dispersed, metadata update is more scattered, not centralized a node, update requests are more scattered, and there are different nodes to deal with, there is a certain delay, which reduces the concurrent pressure data update delay, which may lead to a certain lag in the perception of the cluster.

Decentralized metadata patterns have a variety of optional algorithms for metadata synchronization, such as Paxos, Raft, and Gossip. Paxos and Raft need all or most of the nodes (more than half) to run normally in order for the whole cluster to run stably, while Gossip does not need more than half of the nodes to run.

Gossip protocol, as its name implies, like gossip, uses a random and contagious way to spread information to the whole network, and makes the data of all nodes in the system consistent within a certain period of time. For you, mastering this protocol is not only a good understanding of the most commonly used algorithm to achieve ultimate consistency, but also handy to achieve ultimate consistency of data in follow-up work.

Gossip protocol, also known as epidemic protocol (epidemic protocol), is a protocol based on the exchange of information between nodes or processes in epidemic mode. It is widely used in P2P networks and distributed systems, and its methodology is very simple.

In a cluster in a bounded network, if each node randomly exchanges specific information with other nodes, after a long enough time, the cognition of each node in the cluster will eventually converge to the same.

The "specific information" here generally refers to the status of the cluster, the status of each node, and other metadata. The Gossip protocol is fully compliant with BASE principles and can be used in any area that requires ultimate consistency, such as distributed storage and registries. In addition, it can easily achieve elastic clustering, allow nodes to go online at any time, and provide fast failure detection and dynamic load balancing.

In addition, the biggest benefit of the Gossip protocol is that even if the number of cluster nodes increases, the load on each node will not increase much, almost constant. This allows the scale of nodes managed by Redis Cluster or Consul clusters to scale out to several thousand.

Gossip Communication Mechanism of Redis Cluster

Redis Cluster introduced the clustering feature in version 3.0. In order to let each instance in the cluster know the status information of all other instances, the Redis cluster stipulates that each instance communicates and transmits information according to the Gossip protocol.

The figure above shows a Redis Cluster diagram of the master-slave architecture, where the solid line represents the master-slave replication relationship between the nodes, and the dotted line represents the Gossip communication between the nodes.

Each node in the Redis Cluster maintains a copy of the current status of the entire cluster from its own perspective, including:

Current cluster status slots information of each node in the cluster, and its migrate status the master-slave status of each node in the cluster, the survival status and suspected Fail status of each node in the cluster

In other words, the above information is the content theme of Node gossiping and spreading gossip to each other in the cluster, and it is more comprehensive, both our own and others, so that everyone will pass on to each other, and the final information will be comprehensive and consistent.

Redis Cluster nodes send multiple messages to each other, the more important ones are as follows:

MEET: through the "cluster meet ip port" command, nodes in an existing cluster will send invitations to new nodes to join the existing cluster, and then the new nodes will begin to communicate with other nodes; PING: nodes send ping messages to other nodes in the cluster at configured intervals with their own status, cluster metadata maintained by themselves, and metadata of some other nodes PONG: a node is used to respond to messages from PING and MEET, which is similar in structure to PING messages, but also contains its own status and other information. It can also be used for information broadcast and update. FAIL: when a node PING cannot reach a node, it broadcasts a message that the node is dead to all nodes in the cluster. The other nodes are marked offline after receiving the message.

All message types are defined in the cluster.h file in the source code of Redis, and the code is version 4.0 of redis.

/ / Note that PING, PONG, and MEET are actually the same message.

/ / PONG is a reply to PING, and its actual format is also a PING message

/ / MEET is a special PING message that forces the recipient of the message to add the sender of the message to the cluster (if the node is not already in the node list).

# define CLUSTERMSG_TYPE_PING 0 / * Ping message * /

# define CLUSTERMSG_TYPE_PONG 1 / * Pong is used to reply to Ping * /

# define CLUSTERMSG_TYPE_MEET 2 / * Meet request to add a node to the cluster * /

# define CLUSTERMSG_TYPE_FAIL 3 / * Fail marks a node as FAIL * /

# define CLUSTERMSG_TYPE_PUBLISH 4 / * broadcast messages through publish and subscribe features * /

# define CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 5 / * request a failover operation, requiring the recipient of the message to vote to support the sender of the message * /

# define CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 6 / * the recipient of the message agrees to vote for the sender * /

# define CLUSTERMSG_TYPE_UPDATE 7 / * slots has changed. The message sender requires the message receiver to update accordingly * /

# define CLUSTERMSG_TYPE_MFSTART 8 / * pause each client for manual failover * /

# define CLUSTERMSG_TYPE_COUNT 9 / * Total number of messages * /

Through these messages, each instance in the cluster can get the status information of all other instances. In this way, even if there are events such as new node joining, node failure, Slot change and so on, the cluster status can be synchronized on each instance through the transmission of PING and PONG messages. Next, let's look at several common scenarios in turn.

Timed PING/PONG messages

Nodes in Redis Cluster regularly send PING messages to other nodes to exchange status information of each node and check the status of each node, including online status, suspected offline status PFAIL and offline status FAIL.

The working principle of timed PING/PONG for Redis clusters can be summarized as follows:

First, each instance will randomly select some instances from the cluster according to a certain frequency, and send PING messages to the selected instances to detect whether these instances are online and exchange status information with each other. The PING message encapsulates the status information of the instance that sent the message, the status information of some other instances, and the Slot mapping table. Second, after receiving a PING message, an instance will send a PONG message to the instance that sent the PING message. The PONG message contains the same content as the PING message.

The following figure shows the transmission of PING and PONG messages between two instances, where the first instance is the sending node and the second is the receiving node.

The new node is online

When Redis Cluster joins a new node, the client needs to execute the CLUSTER MEET command, as shown in the following figure.

When a node executes the CLUSTER MEET command, it first creates a clusterNode data for the new node and adds it to the nodes dictionary of the clusterState it maintains. We will have a detailed schematic diagram and source code to explain the relationship between clusterState and clusterNode in the last section.

Then the node sends a MEET message to the new node based on the IP address and port number in the CLUSTER MEET command. After the new node receives the MEET message sent by node 1, the new node will also create a clusterNode structure for node 1 and add the structure to the nodes dictionary of the clusterState maintained by itself.

Next, the new node returns a PONG message to the node. After receiving the PONG message returned by node B, the node knows that the new node has successfully received the MEET message sent by itself.

Finally, Node 1 sends a PING message to the new node. After the new node receives the PING message, it can know that node A has successfully received the P ONG message returned by itself, thus completing the handshake operation of the new node access.

After the MEET operation is successful, the node will send the information of the new node to other nodes in the cluster through the timed PING mechanism mentioned earlier, allowing other nodes to shake hands with the new node. Finally, after a period of time, the new node will be recognized by all the nodes in the cluster.

The node is suspected to be offline and really offline.

The node in Redis Cluster periodically checks whether the recipient node that has sent the PING message has returned the PONG message within the specified time (cluster-node-timeout). If not, it will mark it as suspected offline, that is, PFAIL status, as shown in the following figure.

Then, through the PING message, the node will pass the information that Node 2 is suspected to be offline to other nodes, such as Node 3. After receiving the PING message of Node 1 and knowing that Node 2 has entered the PFAIL state, Node 3 will find the clusterNode structure corresponding to Node 2 in the nodes dictionary of clusterState maintained by itself, and add the offline report of Master Node 1 to the fail_reports linked list of the clusterNode structure.

Over time, if node 10 (for example) also thinks that node 2 is suspected to be offline because of the PONG timeout, and finds that more than half of the number of primary nodes in the clusterNode fail_reports of node 2 that it maintains is not outdated marking node 2 as the PFAIL status report log, then node 10 will mark node 2 as offline FAIL status And node 10 will immediately broadcast the FAIL message that the primary node 2 has been offline to other nodes in the cluster, and all nodes that receive the FAIL message will immediately mark the node 2 status as offline. This is shown in the following figure.

It should be noted that the report is suspected to be offline due to timeliness, and if it exceeds the time limit of cluster-node-timeout * 2, the report will be ignored and node 2 will return to normal.

Implementation of Redis Cluster communication source code

To sum up, we understand the principle and operation flow of Redis Cluster in timing PING/PONG, new node online, node suspected offline and real offline and so on. Let's take a real look at the source code implementation and specific operation of Redis in these links.

The data structure involved

First of all, let's talk about the data structures involved, that is, the structures such as ClusterNode mentioned above.

Each node maintains a clusterState structure that represents the overall state of the current cluster, as defined below.

Typedef struct clusterState {

ClusterNode * myself; / * clusterNode information of the current node * /

....

Dict * nodes; / * dictionary from name to clusterNode * /

....

ClusterNode * slots [cluster _ SLOTS]; / * correspondence between slot and nodes * /

....

} clusterState

It has three key fields, as shown in the following diagram:

The myself field is a clusterNode structure that records one's own state; an nodes dictionary that records a mapping from name to clusterNode structure to record the status of other nodes; and an slot array that records the node clusterNode structure corresponding to slot.

The clusterNode structure holds the current state of a node, such as the creation time of the node, the name of the node, the current configuration era of the node, the IP address and port number of the node, and so on. In addition, the link attribute of the clusterNode structure is a clusterLink structure that holds the information needed to connect nodes, such as * * socket descriptors, input buffers, and output buffers. ClusterNode also has a list of fail_report to record suspected referral reports. The specific definition is as follows.

Typedef struct clusterNode {

Mstime_t ctime; / * time when the node was created * /

Char name [cluster _ NAMELEN]; / * name of the node * /

Int flags; / * Node identification, marking node role or status, such as master node slave node or online and offline * /

Uint64_t configEpoch; / * Unified epoch of the cluster known to the current node * /

Unsigned char slotts [cluster _ SLOTS/8]; / * slots handled by this node * /

Int numslots; / * Number of slots handled by this node * /

Int numslaves; / * Number of slave nodes, if this is a master * /

Struct clusterNode * * slaves; / * pointers to slave nodes * /

Struct clusterNode * slaveof; / * pointer to the master node. Note that it

May be NULL even if the node is a slave

If we don't have the master node in our

Tables. , /

Mstime_t ping_sent; / * the last time the current node sent an PING message to this node * /

Mstime_t pong_received; / * the last time the current node received a PONG message from that node * /

Mstime_t fail_time; / * time when the FAIL flag bit is set * /

Mstime_t voted_time; / * Last time we voted for a slave of this master * /

Mstime_t repl_offset_time; / * Unix time we received offset for this node * /

Mstime_t orphaned_time; / * Starting time of orphaned master condition * /

Long long repl_offset; / * the repl of current node is cheap * /

Char IP [net _ IP_STR_LEN]; / * IP address of the node * /

Int port; / * Port * /

Int cport; / * communication port, usually port + 1000 * /

ClusterLink * link; / * and the tcp connection of this node * /

List * fail_reports; / * offline record list * /

} clusterNode

ClusterNodeFailReport is the structure that records the offline report of the node, node is the information of the reporting node, and time represents the reporting time.

Typedef struct clusterNodeFailReport {

Struct clusterNode * node; / * report the node whose current node has been offline * /

Mstime_t time; / * report time * /

} clusterNodeFailReport

Message structure

Now that we understand the data structure maintained by the Reids node, let's look at the message structure through which the node communicates. The outermost structure of the communication message is clusterMsg, which includes a lot of message record information, including RCmb flag bit, message total length, message protocol version, message type; it also includes the record information of the node sending the message, such as node name, node responsible slot information, node ip and port, etc.; finally, it contains a clusterMsgData to carry specific types of messages.

Typedef struct {

Char sig [4]; / * flag bit, "RCmb" (Redis Cluster message bus). , /

Uint32_t totlen; / * Total message length * /

Uint16_t ver; / * message protocol version * /

Uint16_t port; / * Port * /

Uint16_t type; / * message type * /

Uint16_t count; / * /

Uint64_t currentEpoch; / * represents the unified epoch of the entire cluster currently recorded by this node, which is used to decide the election and vote, etc. Unlike configEpoch, configEpoch represents the only symbol of the master node, and currentEpoch is the only symbol of the cluster. , /

Uint64_t configEpoch; / * each master node is marked with a unique configEpoch. If it conflicts with other master nodes, it will force self-increment to make this node unique in the cluster * /

Uint64_t offset; / * Master-slave replication offset related information, master node and slave node have different meanings * /

Char sender [cluster _ NAMELEN]; / * name of the sending node * /

Unsigned char myslots [cluster _ SLOTS/8]; / * the slots information that this node is responsible for. There are 8 char arrays in 16384Compact, which are 16384bit * /

Char slave of [cluster _ NAMELEN]; / * master information. If this node is a slave node, the protocol carries master information * /

Char MYIP [net _ IP_STR_LEN]; / * IP * /

Char notused1 [34]; / * reserved fields * /

Uint16_t cport; / * Communication port of the cluster * /

Uint16_t flags; / * current status of this node, such as CLUSTER_NODE_HANDSHAKE, CLUSTER_NODE_MEET * /

Unsigned char state; / * Cluster state from the POV of the sender * /

Unsigned char mflags [3]; / * there are only two types of messages in this article: CLUSTERMSG_FLAG0_PAUSED, CLUSTERMSG_FLAG0_FORCEACK * /

Union clusterMsgData data

} clusterMsg

ClusterMsgData is a union structure, which can be a message body such as PING,MEET,PONG or FAIL. Where the ping field is assigned when the message is of type PING, MEET, and PONG, while the fail field is assigned when the message is of type FAIL.

/ / Note this is the union keyword

Union clusterMsgData {

/ * when PING, MEET or PONG messages, the ping field is assigned * /

Struct {

/ * Array of N clusterMsgDataGossip structures * /

ClusterMsgDataGossip gossip [1]

} ping

/ * fail is assigned when FAIL message is sent * /

Struct {

ClusterMsgDataFail about

} fail

/ /.... Omit the fields of publish and update messages

}

ClusterMsgDataGossip is the structure of PING, PONG and MEET messages, and it will include other node information maintained by the sending node, that is, the information contained in the nodes field in clusterState above. The specific code is shown below, and you will also find that the fields of the two are similar.

Typedef struct {

/ * the name of the node is random by default. After the MEET message is sent and the reply is received, the cluster will set the formal name for the node * /

Char nodename[CLUSTER _ NAMELEN]

Uint32_t ping_sent; / * the timestamp of the last PING message sent by the sending node to the receiving node, and will be assigned a value of 0 * / after receiving the corresponding PONG reply.

Uint32_t pong_received; / * the last time the sending node received a PONG message sent by the receiving node * /

Char IP [net _ IP_STR_LEN]; / * IP address last time it was seen * /

Uint16_t port; / * IP*/

Uint16_t cport; / * Port * /

Uint16_t flags; / * logo * /

Uint32_t notused1; / * align characters * /

} clusterMsgDataGossip

Typedef struct {

Char Nodename [cluster _ NAMELEN]; / * name of offline node * /

} clusterMsgDataFail

After looking at the data structure maintained by the node and the message structure sent, let's take a look at the specific behavior source code of Redis.

Send PING messages randomly and periodically

The clusterCron function of Redis is called regularly, and every 10 times it is executed, it is ready to send a PING message to a random node.

It will first randomly select five nodes, then select the node with which it has not communicated for the longest time, and call the clusterSendPing function to send a message of type CLUSTERMSG_TYPE_PING.

/ / cluster.c file

/ / clusterCron () sends gossip information to a random node every 10 times (at least one second interval)

If (! (iteration 10)) {

Int j

/ * randomly select one of the 5 nodes * /

For (j = 0; j)

< 5; j++) { de = dictGetRandomKey(server.cluster->

Nodes)

ClusterNode * this = dictGetVal (de)

/ * do not PING nodes that are disconnected or PING nodes that have been PING recently * /

If (this- > link = = NULL | | this- > ping_sent! = 0) continue

If (this- > flags & (CLUSTER_NODE_MYSELF | CLUSTER_NODE_HANDSHAKE))

Continue

/ * compare the pong_received field to select the node that has not received its PONG message for a longer time (indicating that it has not received the PONG message of this node for a long time) * /

If (min_pong_node = = NULL | | min_pong > this- > pong_received) {

Min_pong_node = this

Min_pong = this- > pong_received

}

/ * send a PING command to the node that has not received a PONG reply for the longest time * /

If (min_pong_node) {

ServerLog (LL_DEBUG, "Pinging node% .40s", min_pong_node- > name)

ClusterSendPing (min_pong_node- > link, CLUSTERMSG_TYPE_PING)

}

We will learn more about the specific behavior of the clusterSendPing function later, because this function is often used in other aspects.

Nodes join the cluster

When a node executes the CLUSTER MEET command, it maintains a clusterNode structure for the new node. The link or TCP connection field of the structure is null, indicating that the new node has not yet established a connection.

The clusterCron function also handles these unconnected new nodes, calls createClusterLink to create the connection, and then calls the clusterSendPing function to send MEET messages

/ * cluster.c clusterCron function section to create a connection for nodes that have not created a connection * /

If (node- > link = = NULL) {

Int fd

Mstime_t old_ping_sent

ClusterLink * link

/ * establish a connection with this node * /

Fd = anetTcpNonBlockBindConnect (server.neterr, node- > ip

Node- > cport, NET_FIRST_BIND_ADDR)

/ *. Exception handling when fd is-1 * /

/ * create link * /

Link = createClusterLink (node)

Link- > fd = fd

Node- > link = link

AeCreateFileEvent (server.el,link- > fd,AE_READABLE)

ClusterReadHandler,link)

/ * send a PING command to the newly connected node to prevent the node from being recognized to go offline * /

/ * if the node is marked as MEET, send the MEET command, otherwise send the PING command * /

Old_ping_sent = node- > ping_sent

ClusterSendPing (link, node- > flags & CLUSTER_NODE_MEET?

CLUSTERMSG_TYPE_MEET: CLUSTERMSG_TYPE_PING)

/ *. , /

/ * if the current node (sender) does not receive a reply from the MEET message, it will no longer send commands to the target node. , /

/ * if a reply is received, the node will no longer be in the HANDSHAKE state and will continue to send normal PING commands to the target node * /

Node- > flags & = ~ CLUSTER_NODE_MEET

}

Prevent node false timeout and status expiration

Preventing false timeouts of nodes and suspected offline tags are also included in the clusterCron function, as shown below. It will check the current list of all nodes nodes, and if it finds that the communication time between a node and its last PONG exceeds half of the predetermined threshold, in order to prevent the node from being a false timeout, it will actively release the link connection with it, and then actively send it a PING message.

/ * cluster.c clusterCron function part, traversing nodes to check the nodes of fail * /

While ((de = dictNext (di))! = NULL) {

ClusterNode * node = dictGetVal (de)

Now = mstime (); / * Use an updated time at every iteration. , /

Mstime_t delay

/ * if the time it takes for PONG to arrive exceeds half of the connections to node timeout * /

/ * because although the node is still normal, there may be something wrong with the connection * /

If (node- > link & & / * is connected * /

Now-node- > link- > ctime >

Server.cluster_node_timeout & & / * not reconnected yet * /

Node- > ping_sent & & / * ping messages have been sent * /

Node- > pong_received

< node->

Ping_sent & & / * still waiting for pong message * /

/ * waiting for pong message exceeded timeout/2 * /

Now-node- > ping_sent > server.cluster_node_timeout/2)

{

/ * release the connection, and clusterCron () will reconnect automatically next time * /

FreeClusterLink (node- > link)

}

/ * if it is not currently on the PING node * /

/ * and node timeout has not received a PONG reply from the node half the time * /

/ * then send a PING to the node to ensure that the information of the node is not too old and may not be at random all the time * /

If (node- > link & &

Node- > ping_sent = = 0 & &

(now-node- > pong_received) > server.cluster_node_timeout/2)

{

ClusterSendPing (node- > link, CLUSTERMSG_TYPE_PING)

Continue

}

/ *. Handle failover and tag loss offline * /

}

Deal with failover and markup suspected offline

If the node still does not receive the PONG message of the target node after preventing the node from false timeout processing, and the time has exceeded cluster_node_timeout, then the node is marked as suspected offline.

/ * if this is a master node and there is a slave server request for manual failover, then send a PING*/ to the slave server

If (server.cluster- > mf_end & &

NodeIsMaster (myself) & &

Server.cluster- > mf_slave = = node & &

Node- > link)

{

ClusterSendPing (node- > link, CLUSTERMSG_TYPE_PING)

Continue

}

/ * subsequent code executes only if the node sends a PING command * /

If (node- > ping_sent = = 0) continue

/ * calculate the length of time to wait for a PONG reply * /

Delay = now-node- > ping_sent

/ * waiting for PONG reply exceeds the limit, mark the target node as PFAIL (suspected offline) * /

If (delay > server.cluster_node_timeout) {

/ * timed out, marked as suspected offline * /

If (! (node- > flags & (REDIS_NODE_PFAIL | REDIS_NODE_FAIL)) {

RedisLog (REDIS_DEBUG, "* NODE% .40s possibly failing"

Node- > name)

/ / Open the suspected offline mark

Node- > flags | = REDIS_NODE_PFAIL

Update_state = 1

}

Actually send Gossip messages

The following is the source code of the clusterSendPing () method that has been called many times before. There are detailed comments in the code, which you can read by yourself. The main operation is to convert the clusterState maintained by the node itself into the corresponding message structure.

/ * send a MEET, PING or PONG message to the specified node * /

Void clusterSendPing (clusterLink * link, int type) {

Unsigned char * buf

ClusterMsg * hdr

Int gossipcount = 0; / * Number of gossip sections added so far. , /

Int wanted; / * Number of gossip sections we want to append if possible. , /

Int totlen; / * Total packet length. , /

/ / freshnodes is the counter used to send gossip messages

/ / each time a message is sent, the program subtracts the value of freshnodes by one

/ / when the value of freshnodes is less than or equal to 0, the program stops sending gossip information

/ / the number of freshnodes is the number of nodes in the current nodes table minus 2.

/ / the 2 here refers to two nodes, one is the myself node (that is, the node that sends the message)

/ / the other is the node that accepts gossip information

Int freshnodes = dictSize (server.cluster- > nodes)-2

/ * calculate the number of nodes to be carried, including at least 3 nodes and a maximum of 1 jump 10 cluster nodes * /

Wanted = floor (dictSize (server.cluster- > nodes) / 10)

If (wanted

< 3) wanted = 3; if (wanted >

Freshnodes) wanted = freshnodes

/ *. Omit the calculation of totlen, etc.

/ * if the message sent is PING, update the timestamp of the last PING command sent * /

If (link- > node & & type = = CLUSTERMSG_TYPE_PING)

Link- > node- > ping_sent = mstime ()

/ * record the information of the current node (such as name, address, port number, slot responsible for processing) into the message * /

ClusterBuildMessageHdr (hdr,type)

/ * Populate the gossip fields * /

Int maxiterations = wanted*3

/ * each node has the opportunity to send gossip messages for freshnodes times.

Send gossip information (gossipcount count) of 2 selected nodes to the target node at a time * /

While (freshnodes > 0 & & gossipcount

< wanted && maxiterations--) { /* 从 nodes 字典中随机选出一个节点（被选中节点） */ dictEntry *de = dictGetRandomKey(server.cluster->

Nodes)

ClusterNode * this = dictGetVal (de)

/ * the following nodes cannot be selected:

* Myself: node itself.

* Node with PFAIL status

* Node in HANDSHAKE state.

* Node with NOADDR logo

* nodes that are disconnected because no Slot is processed

, /

If (this = = myself) continue

If (this- > flags & CLUSTER_NODE_PFAIL) continue

If (this- > flags & (CLUSTER_NODE_HANDSHAKE | CLUSTER_NODE_NOADDR) | |

(this- > link = = NULL & & this- > numslots = = 0)

{

Freshnodes--; / * Tecnically not correct, but saves CPU. , /

Continue

}

/ / check whether the selected node is already in the hdr- > data.ping.gossip array

/ / if so, it means that this node has been selected before.

/ / do not select it again (otherwise there will be repetition)

If (clusterNodeIsInGossipSection (hdr,gossipcount,this)) continue

/ * this selected node is valid, and the counter is minus one * /

ClusterSetGossipEntry (hdr,gossipcount,this)

Freshnodes--

Gossipcount++

}

/ *. If there is a PFAIL node, add * / at last

/ * calculate the length of information * /

Totlen = sizeof (clusterMsg)-sizeof (union clusterMsgData)

Totlen + = (sizeof (clusterMsgDataGossip) * gossipcount)

/ * record the number of selected nodes (how many nodes are included in the gossip information) in the count attribute * /

Hdr- > count = htons (gossipcount)

/ * record the length of the message in the message * /

Hdr- > totlen = htonl (totlen)

/ * send a network request * /

ClusterSendMessage (link,buf,totlen)

Zfree (buf)

}

Void clusterSetGossipEntry (clusterMsg * hdr, int I, clusterNode * n) {

ClusterMsgDataGossip * gossip

/ * point to gossip information structure * /

Gossip = & (hdr- > data.ping.gossip [I])

/ * record the name of the selected node to the gossip message * /

Memcpy (gossip- > nodename,n- > name,CLUSTER_NAMELEN)

/ * record the timestamp of the PING command of the selected node to the gossip information * /

Gossip- > ping_sent = htonl (n-> ping_sent/1000)

/ * record the timestamp of the PONG command reply of the selected node to the gossip message * /

Gossip- > pong_received = htonl (n-> pong_received/1000)

/ * record the IP of the selected node to gossip information * /

Memcpy (gossip- > ip,n- > ip,sizeof (n-> ip))

/ * record the port number of the selected node to gossip information * /

Gossip- > port = htons (n-> port)

Gossip- > cport = htons (n-> cport)

/ * record the identification value of the selected node to gossip information * /

Gossip- > flags = htons (n-> flags)

Gossip- > notused1 = 0

}

The following is the clusterBuildMessageHdr function, which is mainly responsible for populating the basic information in the message structure and the state information of the current node.

/ * build the header of message * /

Void clusterBuildMessageHdr (clusterMsg * hdr, int type) {

Int totlen = 0

Uint64_t offset

ClusterNode * master

/ * if the current node is salve, master is its primary node, and if the current node is a master node, master is the current node * /

Master = (nodeIsSlave (myself) & & myself- > slaveof)?

Myself- > slaveof: myself

Memset (hdr,0,sizeof (* hdr))

/ * initialize the version, identity, and type of the protocol, * /

Hdr- > ver = htons (CLUSTER_PROTO_VER)

Hdr- > sig [0] ='R'

Hdr- > sig [1] ='C'

Hdr- > sig [2] ='m'

Hdr- > sig [3] ='b'

Hdr- > type = htons (type)

/ * header sets the current node id * /

Memcpy (hdr- > sender,myself- > name,CLUSTER_NAMELEN)

/ * header sets the current node ip * /

Memset (hdr- > myip,0,NET_IP_STR_LEN)

If (server.cluster_announce_ip) {

Strncpy (hdr- > myip,server.cluster_announce_ip,NET_IP_STR_LEN)

Hdr- > MIIP [net _ IP_STR_LEN-1] ='\ 0'

}

/ * basic port and communication port of nodes in the cluster * /

Int announced_port = server.cluster_announce_port?

Server.cluster_announce_port: server.port

Int announced_cport = server.cluster_announce_bus_port?

Server.cluster_announce_bus_port:

(server.port + CLUSTER_PORT_INCR)

/ * set the slot information of the current node * /

Memcpy (hdr- > myslots,master- > slots,sizeof (hdr- > myslots))

Memset (hdr- > slaveof,0,CLUSTER_NAMELEN)

If (myself- > slaveof! = NULL)

Memcpy (hdr- > slaveof,myself- > slaveof- > name, CLUSTER_NAMELEN)

Hdr- > port = htons (announced_port)

Hdr- > cport = htons (announced_cport)

Hdr- > flags = htons (myself- > flags)

Hdr- > state = server.cluster- > state

/ * set currentEpoch and configEpochs. , /

Hdr- > currentEpoch = htonu64 (server.cluster- > currentEpoch)

Hdr- > configEpoch = htonu64 (master- > configEpoch)

/ * set replication offset * /

If (nodeIsSlave (myself))

Offset = replicationGetSlaveOffset ()

Else

Offset = server.master_repl_offset

Hdr- > offset = htonu64 (offset)

/ * Set the message flags. , /

If (nodeIsMaster (myself) & & server.cluster- > mf_end)

Hdr- > mflags [0] | = CLUSTERMSG_FLAG0_PAUSED

/ * calculate and set the total length of the message * /

If (type = = CLUSTERMSG_TYPE_FAIL) {

Totlen = sizeof (clusterMsg)-sizeof (union clusterMsgData)

Totlen + = sizeof (clusterMsgDataFail)

} else if (type = = CLUSTERMSG_TYPE_UPDATE) {

Totlen = sizeof (clusterMsg)-sizeof (union clusterMsgData)

Totlen + = sizeof (clusterMsgDataUpdate)

}

Hdr- > totlen = htonl (totlen)

}

Thank you for your reading, the above is the content of "how to understand the Redis Cluster Gossip protocol". After the study of this article, I believe you have a deeper understanding of how to understand the Redis Cluster Gossip protocol, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.