Analysis of how to implement SOFAJRaft and the principle of SOFAJRaft implementation 07/06 Update SLTechnology News&Howtos

Analysis of how to implement SOFAJRaft and the principle of SOFAJRaft implementation

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article is to share with you about the analysis of how to achieve SOFAJRaft and the principle of SOFAJRaft implementation, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article.

SOFAStack Scalable Open Financial Architecture Stack is a financial-level distributed architecture independently developed by Ant Financial Services Group, which includes all the components needed to build a financial-level cloud native architecture. It is a best practice honed in financial scenarios.

SOFAJRaft is a production-level high-performance Java implementation based on Raft consistency algorithm, which supports MULTI-RAFT-GROUP and is suitable for scenarios with high load and low latency.

SOFAJRaft: https://gitee.com/sofastack/sofa-jraft

Preface

Linear consistent read implements Java volatile semantics in a distributed system. When the client initiates a write request to the cluster and obtains a successful response, the result of the write operation is visible to all subsequent read requests. The conventional way to achieve linear consistent reading is to follow the Raft protocol, to process the read request according to Log, and to get the read result back to the client through log replication and state machine execution. SOFAJRaft uses ReadIndex instead of Raft state machine. This paper will analyze the principle of linear consistent reading around Raft Log Read,ReadIndex Read and Lease Read, and explain how SOFAJRaft uses ReadIndex and Lease Read to achieve linear consistent reading.

What is linear consistent reading? Consensus algorithm can only ensure that multiple nodes are consistent with the state of an object. Taking Raft as an example, it can only ensure that different nodes agree to Raft Log. What about the consistency of state machines behind Log?

How to achieve efficient linear consistent reading based on ReadIndex and Lease Read SOFAJRaft?

Linear consistent reading

What is linear consistent reading? The so-called linear consistent reading, a simple example is that we write a value at the moment of T1, then after T1, we must be able to read this value, it is impossible to read the old value before T1 (think of the volatile keyword in Java, that is, linear consistent reading is to implement Java volatile semantics in distributed systems). In short, the Java volatile semantic effect needs to be implemented in a distributed environment, that is, when the Client initiates a write request to the cluster and obtains a successful response, the result of the write operation will be visible to all subsequent read requests. The difference between volatile and volatile is that volatile is visible between implementation threads, while SOFAJRaft needs to implement visibility between Server.

Cdn.nlark.com/yuque/0/2019/png/156670/1561613080963-276c9d3e-a657-40f0-8ffb-d98fd510f6dc.png ">

As shown in the figure above, Client A, B, C and D are all linearly consistent, where D appears to be Stale Read, but in fact it is not. D request spans three stages, and Read may occur at any time, so reading to 1 or 2 is fine.

Raft Log read

The most common way to achieve linear consistent reading is to follow the Raft protocol, process the read request according to Log, get the read result through Log replication and state machine execution, and then return the read result to Client. Because Raft is originally an algorithm to achieve linear consistency in a distributed environment, it is very convenient to implement linear Read through Raft, that is, to Raft Log any read request, and then read the value from the state machine when the Log is submitted to apply, which must be able to ensure that the read value meets the linear requirements.

Of course, because every Read needs to go through the Raft process, Raft Log storage and replication bring disk flushing overhead, storage overhead, network overhead, Raft Log not only has the cost of log storage, but also the network overhead of log replication, and there is also a lot of disk occupation overhead caused by Raft "reading log", resulting in very inefficient Read operation performance, so it has a great impact on performance in scenarios with a lot of read operations. It is unacceptable in systems with a large proportion of reading and is usually not used.

In Raft, the node has three states: Leader,Candidate and Follower. Any write operation of Raft must go through Leader. Only Leader copies the corresponding Raft Log to the node of Majority, which is considered successful. So if the current Leader can determine that it must be Leader, then you can read the data directly on this Leader, because for Leader, if you confirm that a Log has been submitted to most nodes, and the apply writes to the state machine at T1, then the Read after T1 must be able to read the newly written data.

So how do you make sure that Leader must be Leader when dealing with this Read? In the Raft paper, two methods are mentioned:

ReadIndex Read

Lease Read

ReadIndex Read

The first is ReadIndex Read. When Leader needs to process Read requests, Leader exchanges heartbeat information with more than half of the machines to make sure that it is still Leader. It can provide linear consistent reading:

Leader records the commitIndex of his current Log in a Local variable ReadIndex

Then launch a round of Heartbeat to the Followers node. If more than half of the nodes return the corresponding Heartbeat Response, then the Leader can determine that it is still a Leader.

Leader waits for its own StateMachine state machine to execute, at least to the Log recorded by ReadIndex, until applyIndex exceeds ReadIndex, so that it can safely provide Linearizable Read, regardless of whether the Leader has drifted away.

Leader executes the Read request and returns the result to Client.

Using ReadIndex Read to provide the function of Follower Read, it is easy to provide linear consistent reading on Followers nodes. After Follower receives the Read request:

The Follower node requests the latest ReadIndex from Leader

Leader still goes through the previous process, performs the previous three steps (make sure he is really Leader), and returns ReadIndex to Follower

Follower waits for the current state machine to have more applyIndex than ReadIndex

Follower executes the Read request and returns the result to Client.

Instead of using Heartbeat through Raft Log's Read,ReadIndex Read to get Leader to confirm that it is Leader, omitting the Raft Log process. Compared with Raft Log, ReadIndex Read saves disk overhead and can greatly improve throughput. Although there is still network overhead, Heartbeat is inherently small, so the performance is still very good.

Lease Read

Although ReadIndex Read is much faster than the original Raft Log Read, there is still Heartbeat network overhead, so consider further optimization. Raft paper mentions a Lease Read optimization method through Clock + Heartbeat, that is, when Leader sends Heartbeat, first record a point in time Start, when most of the nodes in the system reply to Heartbeat Response, because of the election mechanism of Raft, Follower will not be re-elected until after the time of Election Timeout, and the time of the next Leader election is guaranteed to be greater than Start+Election Timeout/Clock Drift Bound, so it can be considered that the Lease validity of Leader can reach the time point of Start+Election Timeout/Clock Drift Bound. Lease Read is similar to ReadIndex but is further optimized to save not only Log but also network interaction, significantly increased read throughput and significantly reduced latency.

The basic idea of Lease Read is that Leader takes a lease term that is smaller than Election Timeout (preferably an order of magnitude smaller), and there will be no election during the lease period to ensure that the Leader will not change, so skipping the second step of ReadIndex will reduce the latency. Thus it can be seen that the correctness of Lease Read is linked to time and depends on the accuracy of the local clock, so although using Lease Read is very efficient, it still faces the risk problem, that is, there is a preset premise that the time of each server's CPU Clock is accurate, even if there is an error, it will be in a very small Bound range, the implementation of time is very important, if the clock drift is serious. The frequency of Clock is different from server to server, so there may be something wrong with this Lease mechanism.

The implementation of Lease Read includes:

Timed Heartbeat obtains majority response to confirm the effectiveness of Leader

During the validity period of the lease, it can be considered that the current Leader is the only valid Leader in the Raft Group, but the Heartbeat confirmation step in the ReadIndex can be ignored (2)

Leader waits for its own state machine to execute until applyIndex exceeds ReadIndex, so that Linearizable Read can be safely provided.

Implementation of linear consistent read for SOFAJRaft

SOFAJRaft adopts ReadIndex instead of Raft state machine. In short, it relies on the ReadIndex principle to read the results directly from Leader: all Log (which can be regarded as write operations) that have been copied to the majority are regarded as secure Log,Leader state machines. As long as the Log is sequentially executed to this Log, the data embodied in the Log can be visible to the client Client, which is divided into the following four steps:

Client initiates a Read request

Leader confirms the latest LogIndex copied to the majority

Leader confirms identity

Perform the Read operation after LogIndex apply.

With ReadIndex optimization, SOFAJRaft can reach 80% of the upper limit of RPC. In the above steps, we found that step 3 still requires Leader to confirm its Leader identity by sending a heartbeat to Followers, because the Leader identity in the Raft cluster can change at any time. So SOFAJRaft uses Lease Read to omit step 3 RPC. The lease is understood as the identity guarantee that the Raft cluster gives Leader a lease period of Lease, during which time Leader will not be deprived of his identity, so that when Leader receives the Read request, if it finds that the lease has not expired, it does not need to confirm its Leader identity by communicating with Followers, thus skipping the network communication overhead of step 3. Through Lease Read optimization, SOFAJRaft has almost reached the upper limit of RPC. However, maintaining the lease through the clock itself is not absolutely secure (clock drift issue), so the default configuration of SOFAJRaft is linear consistent read, because linear consistent read performance is usually good enough.

ReadIndex Read implementation

By default, the linear consistent read provided by SOFAJRaft is an ReadIndex implementation based on the Raft protocol. In the case of triple copies, the throughput of Leader reads is close to the upper limit of RPC throughput, and the delay depends on the slowest Heartbeat Response in the majority. Use Node#readIndex (byte [] requestContext, ReadIndexClosure done) to initiate a linear consistent read request. When reading safely, the incoming Closure will be called. Normally, the data read from the state machine will be returned to the client, and SOFAJRaft will ensure the linear consistency of the read. Linear consistent reads initiated by nodes in any cluster do not need to be forced to be placed on the Leader node and are allowed to be performed on the Follower node, thus greatly reducing the reading pressure on the Leader.

The ReadIndex linear consistent read implementation of SOFAJRaft based on Raft protocol is to call RaftServerService#handleReadIndexRequest API to process ReadIndex requests according to the current node status of STATE_LEADER,STATE_FOLLOWER and STATE_TRANSFERRING:

1. The current node status is STATE_LEADER, that is, the Leader node. When receiving a ReadIndex request, call the readLeader (request, ReadIndexResponse.newBuilder (), done) method to provide linear consistent reading:

Check the number of current Raft cluster nodes. If only one Peer node in the cluster directly obtains the ballot box BallotBox latest submission index lastCommittedIndex, that is, the commitIndex construction ReadIndexClosure response of the current Log of the Leader node.

The log manager LogManager checks whether the term of office is equal to the current term based on the lastCommittedIndex of the ballot box BallotBox. If it is not equal to the current term, it means that this Leader node has not submitted any logs during its term of office, and the read-only request needs to be rejected.

Verify that the number of Raft cluster nodes and the term of office of the lastCommittedIndex are as expected, then the response constructor sets its index to the lastCommittedIndex of the ballot box BallotBox, and the request from Follower needs to check whether the Follower is currently configured

Gets the ReadIndex request level ReadOnlyOption configuration. The default value of the ReadOnlyOption parameter is ReadOnlySafe,ReadOnlySafe to ensure the linearization of read-only requests by communicating with Quorum. Send the Heartbeat heartbeat request to the Followers node according to the ReadOnlyOption configuration for ReadOnlySafe to call the Replicator#sendHeartbeat (rid, closure) method, and successfully execute ReadIndexHeartbeatResponseClosure heartbeat response callback

The ReadIndex heartbeat response callback checks whether more than half of the nodes, including the Leader node itself, voted in favor, and more than half of the nodes returned a successful response to the client Heartbeat request, that is, if the applyIndex exceeds the ReadIndex, it means that the Log corresponding to the ReadIndex has been synchronized to provide Linearizable Read.

2. The current node status is STATE_FOLLOWER, that is, the Follower node. Receiving ReadIndex requests supports linear consistent reading through the readFollower (request, done) method:

Check whether the current Leader node is empty. If the Leader node is empty, of course there is no Leader node for the term of office.

The Follower node calls the RpcService#readIndex (leaderId.getEndpoint (), newRequest,-1, closure) method to send the ReadIndex request to the Leader, the Leader node calls the readIndex (requestContext, done) method to start the linearized read-only query request, and the read-only service adds the request to publish the ReadIndex event to the queue readIndexQueue, that is, the Ring Buffer of Disruptor

ReadIndex event handler ReadIndexEventHandler triggers the execution of ReadIndex events using executeReadIndexEvents (events) through MPSC Queue model accumulation and batch consumption. Polling ReadIndex events encapsulates ReadIndexState status lists to build ReadIndexResponseClosure response callbacks and submit them to Leader nodes to process ReadIndex requests.

The Leader node calls the handleReadIndexRequest (request, readIndexResponseClosure) method to perform the readLeader linear consistent reading process, and returns the lastCommittedIndex of the ballot box BallotBox. In response to the callback, ReadIndex traverses the status list to record the current submission log Index, checking whether the committedIndex of the latest Log Entry of the application state machine has been applied, that is, comparing whether the state machine appliedIndex is greater than or equal to the current committedIndex. Since the Leader node processes the add LogEntry request and sends the heartbeat to the ballot box BallotBox to update the lastCommittedIndex, when the lastCommittedIndex of the Leader node is greater than the current lastCommittedIndex, an asynchronous task of submitting LogEntry is created and published to the taskQueue queue, and the application task processor ApplyTaskHandler executes the task of submitting the LogEntry application, notifying the Follower node that the newly applied committedIndex has been updated. If the applyIndex of the current application state machine exceeds ReadIndex, notify ReadIndex that the request was successfully returned to the client. When the current Follower node lags behind the Leader, put the committedIndex returned by the Leader node into the pendingNotifyStatus cache and wait for the Leader node to synchronize the log update applyIndex.

SOFAJRaft is based on the ReadIndex core logic of Batch+Pipeline Ack+ fully asynchronous mechanism:

Lease Read implementation

SOFAJRaft adopts Lease Read optimization of Clock+Heartbeat to ensure CPU clock synchronization of machines in the cluster for higher performance scenarios. The server sets the ReadOnlyOption parameter of RaftOptions to ReadOnlyLeaseBased. ReadOnlyLeaseBased ensures the linearization of read-only requests by relying on Leader lease, which may be affected by clock drift. If the clock drift is unlimited, the Leader node may keep the lease longer than it should (the clock can be moved backward / paused without any limit), in which case ReadIndex is not secure.

The implementation of SOFAJRaft based on Lease Read linear consistent reading is that the Leader node calls the handleReadIndexRequest API to receive the ReadIndex request to obtain the ReadIndex request level ReadOnlyOption configuration. When the ReadOnlyOption is configured as ReadOnlyLeaseBased, confirm whether the Leader lease is valid, that is, check whether the Heartbeat interval is less than the election timeout time, and the Leader lease timeout needs to be changed to ReadIndex mode. During the validity period of the Leader lease, it is considered that the current Leader is the only valid Leader in the Raft Group. Ignoring the step of sending Heartbeat to confirm the identity of the ReadIndex, it directly returns the Follower node and the local node to respond to the Read request successfully. The Leader node continues to wait for the state machine to execute until applyIndex exceeds the ReadIndex security to provide Linearizable Read.

SOFAJRaft implements linear consistent read Lease Read optimization logic based on clock and heartbeat:

This paper analyzes the basic principles of Raft Log Read,ReadIndex Read and Lease Read linear consistent reading based on the implementation details of SOFAJRaft linear consistent reading, and expounds how SOFAJRaft uses Batch+Pipeline Ack+ full asynchronous mechanism and Clock+Heartbeat means to optimize the concrete implementation of ReadIndex and Lease Read linear consistent reading.

The above is how to analyze the implementation of SOFAJRaft and the principle of SOFAJRaft implementation. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.