What is the Raft distributed consensus algorithm? 07/16 Update SLTechnology News&Howtos

What is the Raft distributed consensus algorithm?

2025-07-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the Raft distributed consensus algorithm". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn what the Raft distributed consensus algorithm is.

Raft algorithm is a kind of distributed consensus algorithm based on log replication, which aims to pursue better understandability and engineering realizability while providing the same fault tolerance and performance as Multi-Paxos consensus algorithm. Paxos algorithm provides a solution to the consensus problem faced by distributed systems, but its incomprehensible characteristics have been criticized by everyone, not to mention engineering implementation, which is also the main reason for the birth of Raft algorithm.

The author of Raft algorithm believes that understandability and engineering implementability are also the characteristics of an excellent distributed consensus algorithm, and ensures the understandability of the algorithm through the following two points:

Problem decomposition: the distributed consensus problem is divided into four independent sub-problems: primary node election, log replication, security point, and member change one by one.

Simplified state: reduce the number of system states that the algorithm needs to consider by enhancing the degree of consistency at certain stages (for example, constraints can become the candidate node condition for the next Leader).

To understand the operation mechanism of Raft algorithm, we should first have a sense of its application scenario. Take a simple distributed KV system as an example, in which there will be multiple participating nodes, each of which holds a complete copy of the data. The system can respond to the user's read and write request, and can guarantee the linear consistent semantics of the read and write operation, that is, when the user successfully writes a key-value pair, no matter which node initiates the read request, can read the data after the write time point. A typical distributed KV system often limits that only Leader nodes can respond to data update requests, but all nodes can respond to read requests, because most application scenarios have the characteristics of reading more and writing less.

How to ensure that each node holds a complete copy of the data is a problem faced by the KV system, which is also a typical problem faced by the distributed system. The usual solution is to introduce log files, each of which corresponds to an update operation (also known as an instruction) by the user. Logs are generated on the Leader node and copied to all participating nodes in the cluster. If you can ensure that the logs replicated by each node are complete and have the same write order as the Leader node logs, then as long as the instructions carried by these logs are replayed locally on each node in order, it will be possible to achieve a final agreement on the data copy status of each node at some point in the future. This is also the basic idea of a replicable state machine (Replicated State Machines), as shown in the following figure:

At present, we mainly face two problems in this distributed KV system:

How to correctly elect a Leader node?

How to achieve orderly and consistent replication of log data from Leader to other nodes?

These are also two basic problems to be solved by Raft algorithm. Of course, there will be many security issues in the process of solving these two problems, and we will use a section to discuss these issues.

Before formally introducing the running mechanism of Raft algorithm, we first make a brief introduction to some basic attributes and key features of Raft algorithm.

First of all, let's take a look at the basic attributes, which mainly include the concepts of term and logIndex, which are explained as follows:

Term: translated into Chinese as "term of office", it can be more vividly understood as "dynasty", while the Leader node is the king of a specific dynasty (unlike in fact, there may be multiple kings in a dynasty, there is a limit that there can be only one king in a dynasty), and the corresponding term value increases gradually when the dynasty is changed (the initial value is 0, monotonously increasing).

LogIndex: the Raft algorithm uses logging to record the user's operation instructions and maintains the order of the logs by assigning an index to each log. The Raft algorithm requires that the logIndex is monotonously increasing and not repetitive in the same term, and some implementations further require that the logIndex is always monotonously incremental and non-repetitive during the whole operation of the algorithm.

In a distributed system, it is often necessary to introduce the concept of clock, and in order to avoid the impact of physical clock deviation on the correctness of the system, distributed systems often choose logical clock. Together, the attributes term and logIndex can act as a logical clock for the Raft algorithm to measure the chronological order of a log, as explained further below. By analogous to the concept of clock, we can naturally understand why the Raft algorithm restricts that term and logIndex must be monotonously incremented and do not allow repetition and change.

With regard to the logIndex property, there are two special location values to note:

CommittedLogIndex: a log is marked as committed only if the log is successfully copied to more than half of the nodes in the cluster, with lastCommittedLogIndex less than or equal to lastLogIndex.

AppliedLogIndex: a log is marked as applied only if the instruction carried by the log is successfully applied to the business state machine, and the lastAppliedLogIndex is less than or equal to lastCommittedLogIndex, that is, a log that can be applied must be committed.

Let's take a look at the key features of the Raft algorithm:

Strong Leader: a cluster can contain only one Leader node in a single term, and log data flows only one way from Leader nodes to other Follower nodes. This design simplifies the management of log copies and makes the algorithm easier to understand.

Leader Election: based on random timer to trigger Leader election process, compared with the fixed cycle timeout strategy, a small change of randomized timeout period can solve the problem of election split more simply and efficiently.

Membership Changes: based on the joint consensus mechanism to deal with the change events of cluster nodes, two different node configuration schemes are allowed to coexist during the transition period to ensure that the cluster can still operate normally during the node change.

In this article, we focus on the primary node election, log replication, and some security points to consider in the process. When introducing the primary node election and log replication mechanism, we focus on the running process of the algorithm, and some security issues that need to be considered will be left to the security section for unified discussion.

Master node election

The nodes in Raft algorithm are divided into three types of roles: Leader, Candidate and Follower. During a term, all nodes in the cluster can only contain one Leader node, and the remaining nodes are Follower nodes, while the Candidate node is the intermediate state in which Follower tries to run for the next Leader node. The following figure illustrates the evolution of a node role:

The Raft algorithm triggers the primary node election process based on the heartbeat mechanism. During the reign of a Leader node, it will always send heartbeat requests to all Follower nodes to declare its authority. if a Follower node does not receive a heartbeat from the Leader node for a period of time, it will think that "the emperor has died", so the state is switched to Candidate trying to usurp the throne and run for the next Leader.

Raft algorithm is based on the voting mechanism to achieve the primary node election, so the process of a node election is essentially the process of soliciting votes from other nodes in the cluster. The nodes in the Raft cluster generally communicate by RPC, so the candidate nodes will send RequestVote RPC requests to all nodes in the cluster to solicit votes (and cast a vote for themselves). The request content is as follows:

Term: the term value of the running node (the current term value plus 1), that is, the term value if the node is successfully elected as Leader.

CandidateId: the node ID of the selected node.

LastLogIndex: the logIndex value corresponding to the last local log of the selected node.

LastLogTerm: the term value corresponding to the last local log of the selected node.

In order to ensure that only one Leader node is selected in a term, the Raft algorithm requires that the node can only vote for one candidate node in a term, according to the principle of first come, first serve. If the node receiving the request does not currently vote for any of the candidate nodes, it will determine whether the request is satisfied:

The term value of the selected node is greater than or equal to the term value of the current node.

If the term value is equal, the lastLogIndex of the selected node is greater than or equal to the lastLogIndex value of the current node.

Only when the above two conditions are met at the same time, the current node will cast its own valuable vote for the candidate node and promise not to vote for any other candidate node during the current term.

The RequestVote response of the node to the candidate node carries the following information:

Term: the term value of the current voting node.

VoteGranted: whether to agree to vote or not.

Candidates may encounter the following three situations in the process of soliciting votes in a round:

The candidate node wins more than half of the votes in the cluster.

The RequestVote response received contains a higher term value than your term of office, indicating that other nodes have succeeded in running for Leader.

No node has won the vote, that is, no node has won in this round of election.

In view of the first situation, it shows that the running node is successful in running for Leader. In order to prevent some nodes from usurping power, the node needs to immediately send heartbeat requests to all nodes in the cluster to declare its authority.

In case 2, if the term of office of the Leader node is longer than that of the current node, the current node needs to be transformed into a Follower node to obey the rule of the new Leader node, otherwise it can ignore the "Leader" node and continue to launch a revolution to overthrow its regime.

In case 3, it shows that at least two or more nodes have launched the election process at the same time and carved up the votes, resulting in no node winning more than half of the votes. At this point, these nodes can only wait for a period of time and continue to launch a new election process to break the stalemate.

To avoid the third situation as much as possible, a simple and effective solution is to randomize the timeout for each node to initiate the election process. As long as these nodes do not launch a revolution at the same time, the probability of a draw will be greatly reduced.

There is another problem to consider about the primary node election, that is, some nodes do not receive authoritative declaration notifications from Leader nodes properly due to some reasons (such as network delay, high node load, etc.). As a result, these nodes often mistakenly think that "the emperor is dead" and try to usurp the throne. According to the primary node election design of the Raft algorithm, these "disruptive" nodes will increase the term value to initiate the election process, but because most of the nodes' leases with Leader nodes are valid, many times the usurpation plans of these "disruptive" nodes can only end in failure. However, such unnecessary actions will lead to a waste of term values and pose a threat to the stability of cluster operation.

In order to solve such problems, we can divide the electoral process into two stages: pre-elections and formal elections. A node must go through a round of pre-election before initiating a formal election. The difference between pre-election and formal election is that the candidate node will not really increase the term value, but try to use the term value of term + 1 to tentatively solicit votes. If the primary election can be successful, it will really increase the term value, and enter the formal election session to begin to solicit votes, otherwise it will not affect the running status of the cluster.

Log replication

After the correct election of the Leader node, the next core problem that the Raft algorithm needs to solve is how to copy the log data sequentially from the Leader node to all the Follower nodes in the cluster, and ensure the consistency of the log data on each node.

The logs in Raft algorithm are divided into two types: one is the logs generated internally during the operation of the system, such as the configuration change information of cluster nodes; the other is the logs generated when users submit requests to the Raft cluster, which will carry the user's operation instructions and are the main form of Raft logs. However, the Raft algorithm does not distinguish the type of log when performing log replication, but deals with it uniformly.

Leader nodes replicate log data to each Follower node in a concurrent manner. If a log is successfully replicated by more than half of the nodes in the cluster, the log is regarded as committed, that is, it has been submitted. For committed logs, the replicable state locally of the node parses and applies the instructions carried in it, and then marks them as applied.

In addition to carrying operation instructions, the log in Raft algorithm contains at least two basic attributes, term and logIndex (as shown in the figure above), and promises:

If two logs have the same term and logIndex, then the instructions carried by the two logs must be the same.

If two logs have the same term and logIndex, the leading logs of both logs are also the same, that is, they have the same term, logIndex, and directives.

With regard to commitment 1, the Raft algorithm requires that the logIndex in the same term must be monotonously incremented and does not allow repetition and change, so it can be guaranteed.

With regard to commitment 2, when sending log replication requests, the Raft algorithm carries the term and logIndex values of the pre-log (that is, prevLogTerm and prevLogIndex), and can respond to the request successfully only if prevLogTerm and prevLogIndex match. If the prevLogTerm and prevLogIndex do not match, the current node may be new, or previously subordinate to another Leader, or the current node may have been a Leader node. In order to fulfill promise 2, the Leader node needs to go back with the Follower node to find the log that matches term and logIndex, and use the log of the Leader node to forcibly overwrite the log data of the Follower thereafter.

Generally, the Leader node replicates log data to the target Follower node based on the RPC request. The corresponding replication log AppendEntries RPC request is as follows:

Term: the term value of the Leader node.

LeaderId: the Leader node the ID,Follower node can redirect the request to the Leader node accordingly.

PrevLogIndex: the logIndex value of the preceding log.

PrevLogTerm: the term value of the preceding log.

Entries: the collection of log data replicated in this request, which can be used as a heartbeat request if it is empty.

LeaderCommit: the lastCommittedLogIndex value of the Leader node.

The content of the Follower node about the AppendEntries response is as follows:

Term: the term value of the Follower node. If it is larger than the Leader node, the Leader node updates its own term value and abdicates.

Success: identifies whether the Follower node successfully replicated the log data in the request.

After receiving the AppendEntries request, the Follower node follows the following process:

If the term value in the request is less than its own term value, indicating that the source Leader node of the request has expired, the request is rejected and its own term value is returned

If the corresponding prevLogTerm and prevLogIndex logs do not match the local, the request is rejected to fulfill promise 2

If there is a local log whose logIndex and prevLogIndex are equal, but the corresponding term value is not the same as prevLogTerm, you need to delete all subsequent logs that contain that log to fulfill promise two. If you use a globally incremental logIndex design, this will not happen.

In the case of ensuring that prevLogTerm and prevLogIndex match, if the local does not contain the log data in the request, then append the log data locally

If the leaderCommit in the request is larger than the committedLogIndex of the local record, the update local committedLogIndex value is min (leaderCommit, lastLogIndex).

At the implementation level, as a Leader node, you can start a thread for each Follower node specifically responsible for copying log data to that Follower node and maintaining the heartbeat. Considering the load and network connectivity of each Follower node, this design ensures that the process of replicating log data to each Follower node is independent of each other. If there is a problem with the host or network to which a Follower node belongs, the corresponding thread needs to retry the AppendEntries request until it is successful, but this does not affect the process of copying log data to other Follower nodes.

Safe point

Above we introduced the design of the Raft algorithm about the primary node election and log data replication mechanism, which is generally easy to understand. However, so far we are all based on an ideal operating environment, and the situation we are facing in actual operation is much more complicated. as a distributed consensus algorithm, we should provide a variety of fault-tolerant mechanisms in the face of complex environment. In this section, we will discuss the security problems that may exist in the actual operation, and how to solve the Raft algorithm.

Leader node downtime

Through the previous introduction of the primary node election mechanism, the Raft algorithm will select a new Leader node from the cluster node based on the timeout mechanism when the Leader node is down. Imagine that if the newly selected Leader node does not fully synchronize the log data that the previous Leader node has committed, then according to the rule of log replication of the Raft algorithm, the new Leader node will forcibly overwrite the log data conflicted by other nodes in the cluster. If some nodes have applied the instructions carried by these overwritten logs to their own state machines, it will lead to data inconsistencies between nodes. therefore, when performing the primary node election, all nodes should not be given a chance to succeed.

The idea of the Raft algorithm to solve the above problems is to restrict that only the nodes that contain all the committed logs have a chance to win the election. This is why the Raft algorithm adds the following two restrictions on whether the voting node agrees to vote when soliciting votes:

The term value of the candidate node is greater than or equal to the term value of the voting node.

If the term value is equal, the lastLogIndex of the candidate node is greater than or equal to the lastLogIndex value of the voting node.

So do these two conditions guarantee that the newly selected Leader node must contain all logs that have already been committed? We need to make the following two points clear first:

The condition that a log is committed is that the log is successfully replicated by more than half of the nodes in the cluster.

The condition for a node to succeed in running for Leader is that it wins more than half of the votes of the nodes in the cluster.

So the above two "more than half" can ensure that at least one node not only contains the latest committed log, but also cast its own vote for the candidate node. Then with the constraints of the primary node election, we can conclude that the log of the newly elected Leader node must be newer than that of the node, so the newly elected Leader node must contain the latest log of the committed.

The above analysis shows that in the process of Leader switching, the new Leader will not overwrite the log of the previous Leader and committed, so it will not cause data inconsistency.

Leader node downtime not only affects the primary node election, but also affects log replication. The above sequence diagram describes the process of Raft cluster processing client request for submitting instructions and copying logs. The Leader node may be down at any stage. Let's discuss how the Raft algorithm ensures data consistency when the Leader node is down.

Leader downtime before copying logs to Follower nodes

After the Leader node has generated the log for the instruction, it writes the corresponding log to the local storage system, and then starts to copy the log concurrently to all Follower nodes in the cluster. If the Leader node goes down before starting replication, the corresponding log is currently in the uncommitted state. The newly elected Leader node must not contain the log, so when the previous Leader node recovers from the downtime and starts in the Follower role, the new Leader forcibly overwrites the conflicting log with the local log. Because the instructions corresponding to the log are not applied to the state machine, it does not cause data inconsistencies.

Leader downtime between copying logs to Follower nodes

The Leader node experienced downtime between copying the log to the Follower node, when the log was in the uncommitted state. This stage is divided into two situations:

Replicated to the Follower node in a small part of the cluster.

Replicated to most of the Follower nodes in the cluster.

For case 1, the new Leader node does not necessarily contain the log. If it does, it will be dealt with in the second case. If it does not, the new Leader will forcibly overwrite the conflicting log with the local log. Because the instructions corresponding to the log are not applied to the state machine, it does not cause data inconsistencies.

For case 2, the new Leader node must contain the log, but it is not known whether the previous Leader has submitted the log, so the decision cannot be made based on the status of the previous Leader node, and the replication of the log needs to continue. We can think that for committed logs, the new Leader can determine that these logs have been or will be successfully replicated by all nodes in the cluster, but for those uncommitted logs, the new Leader node needs to try to copy them to each Follower node again and decide whether to submit these logs according to its replication status.

An extreme scenario for the second case is an outage of the Leader node before it is ready to commit the log.

Leader downtime before responding to the client

The Leader node is down before responding to the client, and the log is already in the committed state locally in the Leader, so the new Leader must contain the log locally. However, the new Leader may not have been notified that the log has been committed, but according to the log replication mechanism of the Raft algorithm, the log must be marked as committed by the new Leader in the future.

In this case, the client will mistakenly think that the submission of the instruction failed because of the wait timeout, so it will try to resubmit the instruction, and the client needs to consider the idempotence of the instruction.

Follower or Candidate node downtime

Compared with Leader node downtime, the process of dealing with Follower or Candidate node downtime is much simpler. Candidate is the intermediate state in which the node participates in the election. If there is an outage, it will only cause the node to lose the election and will not affect the running state of the cluster. The Follower node always passively receives requests throughout the operation of the algorithm, and the Raft algorithm requires that when copying logs to a Follower node fails, the Leader node should cycle and retry until the replication is successful. Therefore, when a Follower node resumes operation from an outage, it continues to copy the logs from the location of the downtime without losing the logs.

Time level consideration

When designing a distributed system, we often need to judge the sequence of events based on time, where the time should depend on the logical clock rather than the physical clock. As mentioned earlier, in the design of the entire Raft algorithm, combining the two dimensions of term and logIndex can act as a logical clock. In addition, because the normal operation of the Raft algorithm depends on the timeout policy, the following aspects of time need to be considered:

BroadcastTime: the average network latency between cluster machines.

ElectionTimeout: the election timeout for the Raft node.

MTBF: the average time interval between machine downtime.

If you want to ensure the normal operation of the Raft algorithm, you need to meet the broadcastTime

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.