How to implement distributed consensus algorithm Raft 07/09 Update SLTechnology News&Howtos

How to implement distributed consensus algorithm Raft

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article focuses on "how to implement distributed consensus algorithm Raft", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Let's let the editor take you to learn how to implement the distributed consensus algorithm Raft.

On the principle of CAP

The principle of C (consistency) A (availability) P (partition tolerance) is a topic that can never be bypassed by distributed systems. in any distributed system, availability, consistency and partition tolerance contradict each other.

AP: if the system is required to be highly available (A) and partition fault tolerant (P), then consistency (C) must be abandoned

CP: if strong data consistency (C) is required, availability can not be guaranteed because network partitions lead to unlimited synchronization time (P), then give up availability (A)

CA: if there is no network partition (partition refers to different computer rooms / countries) (P), then strong consistency (C) and availability (A) can be met at the same time.

Brief introduction of Raft consistency algorithm

In a Raft cluster, each node corresponds to a role, either Leader (leader node) or Follower (follow node), and each node can be Candidate (candidate node) before Leader is elected.

The Raft algorithm stipulates that the Raft cluster can only have one Leader node, and only the Leader node can handle the read and write requests of the client, translate the write requests into operation diaries, and the Leader nodes copy the operation diaries to other Follower nodes. When the Leader node successfully synchronizes an operation diary to most nodes (including most nodes), it can apply the operation diary to the state machine. Write operations (execute commands) are performed by the state machine to ensure the ultimate consistency of the data.

We can think of Binlog as a command for write operations performed by an Mysql database, while the MyISAM storage engine is a state machine for Binlog to execute commands.

Two RPC interfaces that need to be implemented to implement the Raft algorithm:

RequestVoteRpc: the current candidate node initiates canvassing requests to other nodes during the election

AppendEmtriesRpc: the Leader node sends journal replication requests, heartbeat requests, and submit journal requests to other Follower nodes.

Timing heartbeat timer

Leader nodes need to regularly send heartbeats to other Follower nodes to refresh the election timeout on other Follower nodes.

The heartbeat timer starts when a node becomes a Leader node and stops when a node becomes a Follower node. The heartbeat timeout interval is required to be longer than the timeout election interval, that is, Heartbeat Timeout (heartbeat packet broadcast time)

< Election Timeout(选举超时时间)。超时选举计时器当计时达到超时(Election Timeout)阈值时触发Leader选举，当前节点将任期号+1，并尝试给自己投一票(如果还未将票投给其它候选人)，给自己投票成功则将自己变成候选人，并向其它节点发起拉票请求。超时选举计时器的当前计时可被重置，在接收到AppendEntriesRPC(含心跳请求)请求时重新计时。要求每个节点的超时阈值要不一样，避免同时发起拉票请求，导致多轮选举都未能选出Leader的情况发生。 Leader选举流程 Leader通过投票选举机制选举，每个任期号每个节点都只能有一票，每个节点都优先考虑投给自己，获得多数选票的节点将成为Leader节点，因此Raft集群要求至少3个节点，并且Raft集群节点总数最好是奇数。 RequestVoteRpc请求数据包(拉票数据包)： public class RequestVote { private long term; private int candidateId; private long lastLogIndex; private long lastLogTerm; } term：拉票方(候选节点)的当前任期号; candidateId：拉票方的节点ID; lastLogIndex：拉票方最新日记条目的索引值; lastLogTerm：拉票方最新日记条目对应的任期号。 RequestVoteRpc响应数据包(投票数据包)： public class RequestVoteResp { private long term; private boolean voteGranted; } term：投票方的当前任期号，用于告知拉票方更新term值; voteGranted：如果投票方将选票投给拉票方，则voteGranted为true，否则为false。

The process of initiating an canvassing request when the election timer expires is as follows:

1) add the current term number (term) maintained locally by 1

2) vote for yourself, and then successfully switch its state to the candidate node (Candidate), so the first vote of each candidate node comes from its own.

3) send RequestVoteRPC requests (canvassing requests) to other nodes in their cluster, asking them to vote for themselves.

When each node receives a canvassing request from another candidate node, it should respond as follows according to the node's current term number, diary synchronization, whether the current vote has been cast to other nodes (including its own), and so on:

1) if the term of the canvassing party is less than its current term, return false to remind the canvassing party that term is out of date, and explicitly tell the canvassing party that the vote will not be voted for it

2), if the term of the canvassing party is greater than its current term, and if no one has previously voted for anyone (including himself), then vote for that node and return the term and true of the canvassing party

3), otherwise, if the term of the canvassing party is equal to its current term, if the vote has been voted for the canvassing party (repeated request scenarios), and the diary of the requesting party is as new as its own, the term and true of the canvassing party will be returned

4) otherwise, if the ballot has been cast for someone else before then, the ballot cannot be cast for the requesting Party and explicitly informs the requesting Party that the ballot will not be cast for it.

After the candidate node broadcasts the canvassing request, the candidate node should make the following response according to the final voting result:

1). If the connection of most nodes is abnormal, then continue to re-launch a canvassing in the previous period, that is, the majority of nodes hang up the election exception.

2) get the votes of most nodes to become Leader, including one vote for yourself, but each node has only one vote, and if you vote for yourself, you cannot vote for other nodes.

3) other nodes are found to have won the election (when the term of the canvassing request response is greater than the term of the current candidate node, the other nodes are considered to have won the election), then actively switch back to Follower

4) when the timeout election timer triggers the timeout election, the heartbeat packet of Leader is not received, and no node wins the election to become Leader in the last election, then continue to initiate the election.

If other nodes become the current Leader,Leader, they will inform themselves by sending heartbeat packets that they should have enough time for Leader to send heartbeats to themselves, so the election timeout is greater than heartbeat timeout, that is, Heartbeat Timeout (heartbeat packet broadcast time) < Election Timeout (election timeout).

After the election, each Follower node must record which the previous Leader node is, and the Leader node must record all other Follower nodes. Leader nodes need to send heartbeats and journal synchronization requests to other Follower nodes, while other Follower nodes need to tell clients to redirect to Leader nodes to send requests when they receive client requests.

Raft log replication process

In the Raft cluster, the Leader node is responsible for receiving read and write requests from the client, and if the Follower receives the request, it needs to redirect the request to the Leader node.

If the Leader node receives a read request, the Leader node can directly query the data and respond to the client If the Leader node receives a write request, the Leader node first translates the write request into an operation diary, Append the operation diary locally, and initiates an AppendEntriesRPC call to other nodes to copy the operation diary to other nodes. After successfully copying most nodes, the Leader node submits the operation diary, applies it to the state machine, and then asynchronously initiates AppendEntriesRPC calls to other nodes Inform other Follower nodes that the journal has been submitted, and after receiving the submission request, the Follower node first changes the journal to the submitted state, and then applies the journal to the state machine.

AppendEntriesRPC request packet (Leader node initiates a rpc request to other Follower nodes, asking other Follower nodes to copy this journal entry):

Public class AppendEntries implements Cloneable {private long term; private int leaderId; private long prevLogIndex; private long prevLogTerm; private long leaderCommit; private CommandLog [] entries;}

The term number of the term:Leader node when it created the journal entry

The ID of the leaderId:Leader node, so that other Follower nodes can redirect client requests to the Leader node

The index of the latest journal in the journal submitted by the prevLogIndex:Leader node

The term number of the latest diary in the diary submitted by the prevLogTerm:Leader node

The leaderCommit:Leader node maintains a leaderCommit for each Follower, indicating that the Leader node believes that the diary entry index value that Follower has submitted

Entries: the journal entry to be appended to the Follower. If it is a heartbeat, entries is empty.

AppendEntriesRPC response packet (AppendEntriesRPC response):

Public class AppendEntriesResp {private long term; private boolean success;}

Term: current term number. The value is Max (term,Follower locally maintained term carried by AppendEntries request). It is used for Leader nodes to update their term numbers. Once a Leader node finds that its term number is larger than its own, it indicates that it is an outdated Leader and needs to stop sending heartbeats and actively switch to Follower.

Success: whether the recipient (Follower) can match prevLogIndex and prevLogTerm indicates that the request is successful.

The process by which the Leader node processes the client write request and copies the write request journal to the Follower:

0), the client sends a write request to Leader

1) Leader parses the write request into an operation instruction diary and appends it to the local log file

2) Leader asynchronously sends AppendEntriesRPC requests to other Follower nodes

3) the blocking waits for the majority of nodes to respond successfully, which is at least the total number of nodes divided by 2 plus 1. Because the Leader node itself is one, it only needs the total number of nodes divided by 2 nodes to respond successfully.

4) if most nodes respond successfully: Leader submits and applies the log entry to the local state machine, asynchronously informs other Follower nodes that the diary has been submitted, and then immediately returns the operation result to the client

5), otherwise: the response failed to the client.

The Follower node processes the journal replication request process:

0), any AppendEntriesRPC request received (including heartbeat packet request, submit journal request, append journal request), resets the current timing of the election timeout timer

1) if your own term is greater than the request parameter term, and the term number of the locally recorded Leader is less than yourself, your own term is returned, and the success is false (inform the requester that you are an expired Leader)

2), otherwise, if the term number of the Follower itself in the prevLogIndex diary does not match the request parameter prevLogTerm, its own term is returned and the success is false (the diary of the current Follower node lags behind)

3) otherwise, if it is only a heartbeat packet, it means you have received the heartbeat of Leader, indicating that you are already Follower. If necessary, switch yourself from a candidate node to a Follower node and return your own term, and success is true.

4) otherwise, Follower checks the consistency of the diary, deletes the diary that already exists but is inconsistent, adds any entries that do not exist in the existing diary, deletes the redundant entries, and if the copy is successful, submit it directly.

5) if the leaderCommit of the request parameter is greater than its current commitIndex, update the commitIndex to Max (leaderCommit,commitIndex), optimistically jumping the commitIndex of the local submitted diary to the value that the leader keeps track of for the Follower, which is used in the scenario where the Follower has just recovered from the failure.

If the Follower node fails to append the response journal to the Leader node and the current period number of the Follower node is less than or equal to the current period number of Leader, the Leader node decrements the request parameter prevLogIndex and then re-initiates the AppendEntriesRPC request until the AppendEntriesRPC returns successfully, which indicates that the leader is consistent with the follower in the log entry in the prevLogIndex location. At this point, all log entries before the prevLogIndex location on the Follower node will be retained, and all log entries after the prevLogIndex location (which conflicts with Leader) will be deleted by Follower, and all log entries on the Leader after the prevLogIndex location will be appended from that location. Therefore, once the AppendEntriesRPC returns successfully, the logs of Leader and Follower can be consistent.

Consistency

Because a candidate node must get a majority vote in order to become a Leader, and the node will not vote for a new candidate node without its own log when voting, and the Leader only submits the diary when it has successfully synchronized the diary to a majority of nodes (including itself) (turning the diary into a committed state and applying it to the state machine at the same time), the Leader elected each time is the node that contains all the submitted logs.

When the new Leader node synchronizes the new journal to a Follower node, if the journal of the Follower node lags far behind, the Follower node will actively remove the journal that is not on the Leader and synchronize the journal of the Leader node to the Follower. For journals where the Leader node has been marked as submitted, the Follower can be applied directly to the state machine upon receipt to maintain the final consistency of the data.

Multi Raft

Suppose you have three machines, each deploying a Raft node service, and since read and write requests are handled by the Leader node, then only one machine can work?

We can start multiple Raft services for a node service (note that it is not multiple processes) and construct multiple Raft clusters, that is, Multi Raft, so that the Leader nodes of each Raft cluster can be evenly distributed across multiple machines. For example:

Machine Raft node machine 1Raft service A node 1 (Leader) Raft service B node 1 (Follower) Raft service C node 1 (Follower) machine 2Raft service A node 2 (Follower) Raft service B node 2 (Leader) Raft service C node 2 (Follower) machine 3Raft service A node 3 (Follower) Raft service B node 3 (Follower) Raft service C node 3 (Leader)

In the distributed database TiDB, Multi Raft is used to slice the data, and each Raft cluster is responsible for a part of the data separately.

At this point, I believe you have a deeper understanding of "how to implement the distributed consensus algorithm Raft". You might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.