SOFAJRaft-RheaKV MULTI-RAFT-GROUP implementation Analysis of the implementation principle of SOFAJRaft 07/02 Update SLTechnology News&Howtos

SOFAJRaft-RheaKV MULTI-RAFT-GROUP implementation Analysis of the implementation principle of SOFAJRaft

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article mainly introduces "SOFAJRaft-RheaKV MULTI-RAFT-GROUP implementation and analysis of the implementation principle of SOFAJRaft". In daily operation, I believe that many people have doubts about the implementation principle of SOFAJRaft-RheaKV MULTI-RAFT-GROUP implementation and analysis of SOFAJRaft. The editor has consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts of "SOFAJRaft-RheaKV MULTI-RAFT-GROUP implementation analysis of SOFAJRaft implementation principle". Next, please follow the editor to study!

SOFAStack Scalable Open Financial Architecture Stack is a financial-level distributed architecture independently developed by Ant Financial Services Group, which includes all the components needed to build a financial-level cloud native architecture. It is a best practice honed in financial scenarios.

Preface

RheaKV is the first distributed embedded key-value (key, value) database based on JRaft. Now this paper will analyze the source code and examples from how RheaKV uses MULTI-RAFT-GROUP to achieve the high performance and capacity scalability of RheaKV.

MULTI-RAFT-GROUP

Through the description of Raft protocol, we know that when users update a group of Raft systems, they must first go through Leader, and then synchronize to most Follower by Leader. In practical application, a group of Raft Leader often has a single point of traffic bottleneck, high traffic can not be carried, and each node is full of data, so it will be limited by the storage of nodes, resulting in capacity bottlenecks and can not be expanded.

MULTI-RAFT-GROUP solves the disk bottleneck by dividing the entire data horizontally into multiple Region, and then each Region scales out the Raft group that should have an independent Leader and one or more Follower. In this case, the system has multiple write nodes to share the write pressure, as shown below:

Cdn.nlark.com/yuque/0/2019/jpeg/325890/1557569369003-7d4762a0-2590-48bc-afc9-b4e53b520054.jpeg ">

Now that the disk and Raft Group O bottlenecks are solved, how do multiple CPUs work together? let's move on.

Election and replication

RheaKV mainly consists of three roles: PlacementDriver (hereinafter referred to as PD), Store, and Region. Because RheaKV supports multiple groups of Raft, there is one more PD role than a single group scenario, which is used to schedule and collect basic information for each Store and Region.

PlacementDriver

PD is responsible for the management and scheduling of the whole cluster, Region ID generation and so on. This component is not required. If you do not use PD, set the fake property of PlacementDriverOptions to true. PD generally schedules the Region through the heartbeat return information of Region. After Region processing, PD will receive the change information of Region in the next heartbeat return to update the route and status table.

Store

Usually a Node is responsible for a Store,Store can be seen as a container of Region, in which multiple shards of data are stored. Store actively reports the StoreHeartbeatRequest heartbeat to PD, which is processed by PD's handleStoreHeartbeat, which contains the basic information of the Store, such as how much Region it contains, which Region's Leader is in the Store, and so on.

Region

Region is the smallest unit of data storage and relocation, corresponding to an actual data interval in Store. Each Region will have multiple copies, each stored in a different Store, together to form a Raft Group. The Leader in the Region will actively report the RegionHeartbeatRequest heartbeat to the PD, which will be processed by the handleRegionHeartbeat of the PD, while the PD senses whether the Region has changed through the Epoch of Region.

RegionRouteTable routing table component

Multi-Region of Muti-Raft-Group is managed by RegionRouteTable routing table components, which can add, update and remove Region through addOrUpdateRegion and removeRegion, including the split of Region. The aggregation of Region has not been implemented yet, and it will be considered later.

Partition Logic and algorithm Shard

"Let each group of Raft be responsible for some of the data."

Data partitioning or slicing algorithm is usually Range and Hash,RheaKV data is sliced through Range, divided into Raft Group, also known as Region. Why is it designed as Range here? The reason is that Range segmentation is based on the byte sorting of Key, and then do each segment after each segment. Operations like scan will concentrate queries on similar key in a certain Region as far as possible, which is not supported by Hash. Even if a single Region split is encountered, it will be better handled. Only part of the metadata needs to be modified, and large-scale data movement will not be involved.

Of course, there is also a problem with Range, that is, there may be a Region that is frequently manipulated to become a hot Region. However, there are some optimizations, such as PD scheduling hot Region to more idle machines, or providing Follower to share the burden of reading.

The structure of Region and RegionEpoch is as follows:

Class Region {long id; / / region id / / Region key range [startKey, endKey) byte [] startKey; / / inclusive byte [] endKey; / / exclusive RegionEpoch regionEpoch; / / region term List peers / / all peers in the region} class RegionEpoch {/ / Conf change version, auto increment when add or remove peer long confVer; / / Region version, auto increment when split or merge long version;} class Peer {long id; long storeId; Endpoint endpoint;}

Region.id: the unique identity of the Region, assigned globally and uniquely through the PD.

Region.startKey, Region.endKey: this represents the range of key of Region [startKey, endKey). It is worth noting that the startKey for the initial Region and the endKey for the last Region are empty.

Region.regionEpoch: when Region adds or deletes Peer, or split, etc., regionEpoch will change, in which confVer will be incremented after configuration modification, and version will increase every time there are operations such as split, merge (not yet implemented).

Region.peers:peers refers to the node information contained in the current Region, Peer.id is also globally assigned by PD, and Peer.storeId represents the Store in which Peer is currently located.

Read and write Read / Write

Because the data is split into different Region, we need to operate multiple Region when reading, writing and updating multiple key. In this case, we need to get a specific Region before the operation, and then operate on different Region separately. Let's take the scan operation on multiple Region as an example, and the goal is to return all the data in a certain key interval:

Let's first look at the asynchronous implementation of the core calling method internalScan of the scan method:

For example: com.alipay.sofa.jraft.rhea.client.DefaultRheaKVStore#scan (byte [], byte [], boolean, boolean)

It is easy to see that when calling scan, first let PD Client retrieve the Region covered by startKey and endKey through RegionRouteTable.findRegionsByKeyRange, and finally return multiple Region. The specific Region coverage retrieval method is as follows:

The retrieval related variables are defined as follows:

We can see that the range routing table of the entire RheaKV is stored through TreeMap, which echoes that all key is sorted and stored by corresponding bytes. The corresponding Value is the RegionId of the Region, and then we can find it through the Region routing regionTable.

Now that we have all the Region:List covered by scan, we can see that there is a Lambda expression of "retryCause-> {}" in the circular query. It is easy to see that this is a blessing exception retry. As we will talk about later, we will query the results of each Region through internalRegionScan. The specific source codes are as follows:

There is also a retry process here, where you can see that the code decides whether to query locally or through RPC based on whether it is a Region node or not, and if it is native, call rawKVStore.scan () for local direct query, or RPC remote node query through rheaKVRpcService. Finally, each Region query is returned as a future, and the result List is returned asynchronously and concurrently through the FutureHelper.joinList utility class CompletableFuture.allOf.

Let's take a look at the specific process of writing. Compared with scan reading, put writing is relatively simple. You only need to calculate the corresponding Region for key and then store it. We can take a look at an example of asynchronous put.

For example: com.alipay.sofa.jraft.rhea.client.DefaultRheaKVStore#put (java.lang.String, byte [])

We can find that the basic method of put supports batch and can be submitted in batches. If you do not use batch, submit directly. The specific logic is as follows:

Query the corresponding storage Region through pdClinet, and get the RegionEngine through regionId, and then put through the corresponding storage engine KVStore. The whole process also supports the retry mechanism. If we go back and look at the implementation of batch, it is easy to find that the RingBuffer ring buffer of Disruptor is utilized, and the unlocked queue provides a guarantee of performance. The code scene is as follows:

Split / Merge

When will Region be split?

As we mentioned earlier, PD will schedule the Region in the heartBeat of Region. When the number of keys in a Region exceeds the preset threshold, we can split the Region, and the Store's state machine KVStoreStateMachine receives the split message for split processing. The specific split source code is as follows:

The KVStoreStateMachine.doSplit source code is as follows:

The StoreEngine.doSplit source code is as follows:

We can easily see that the original parentRegion is split into region and pRegion, and the startKey, endKey, and version numbers are reset and added to RegionEngineTable to register with RegionKVService, while calling the pdClient.getRegionRouteTable (). SplitRegion () method to update the Region routing table stored in PD.

When do you need to merge Region?

Since too much data needs to be split, then the Region to merge that must be 2 or more consecutive Region data is significantly less than the vast majority of Region capacity, then we can merge it. Implementation will be considered later in this piece.

Analysis of RegionKVService structure and implementation StoreEngine

As we know above, a Store is a node, which contains one or more RegionEngine. A StoreEngine usually calls PD through PlacementDriverClient and has a StoreEngineOptions configuration item, which is configured with storage engine and node-related configurations.

Let's take the default DefaultRheaKVStore load StoreEngine as an example. DefaultRheaKVStore implements the basic functions of the RheaKVStore interface, starting with the init method, loading the pdClinet instance according to RheaKVStoreOptions, and then loading storeEngine.

When StoreEngine starts, we will first load the corresponding StoreEngineOptions configuration, build the corresponding Store configuration, and generate consistent read thread pool readIndexExecutor, snapshot thread pool snapshotExecutor, RPC thread pool cliRpcExecutor, Raft RPC thread pool raftRpcExecutor, as well as storage RPC thread pool kvRpcExecutor, heartbeat sender HeartbeatSender, etc. If we open the code, we can also see metricsReportPeriod, and the open configuration can be used to monitor performance metrics.

After DefaultRheaKVStore loads all the processes, you can use get, set, scan and other operations, as well as corresponding synchronous and asynchronous operations.

In this process, the StoreEngine records regionKVServiceTable and regionEngineTable, and they respectively master the operation functions of each different Region storage, and the corresponding key is RegionId.

RegionEngine

In each copy of Region in Store, RegionEngine is an execution unit. It records the associated StoreEngine information and the corresponding Region information. Because it is also an election node, it also contains the corresponding state machine KVStoreStateMachine and the corresponding RaftGroupService, and starts the RpcServer in it for election synchronization.

There is a transferLeadershipTo method in this, which can be called to balance the Leader of the current node partition to avoid pressure overlap.

DefaultRegionKVService is the default implementation class for RegionKVService and mainly deals with specific operations on Region.

RheaKV FailoverClosure interpretation

What needs to be mentioned in particular is that in the specific RheaKV operation, FailoverClosure plays an important role and adds a certain degree of fault tolerance to the whole system. If in a scan operation, if multi-node scan data is needed across the Store, any network jitter will cause incomplete data or failure, so allowing a certain number of retries is beneficial to improve the availability of the system, but the number of retries should not be too high. If there is network congestion, multiple timeout level failures will bring additional pressure on the system. Here, you only need to configure the number of failoverRetries settings in DefaultRheaKVStore.

PlacementDriverClient of RheaKV PD

The PlacementDriverClient interface is mainly implemented by AbstractPlacementDriverClient, and then FakePlacementDriverClient and RemotePlacementDriverClient are the main functions. FakePlacementDriverClient is the simulation of PD objects when the system does not need PD. Here we mainly talk about RemotePlacementDriverClient.

RemotePlacementDriverClient is loaded through PlacementDriverOptions and refreshes the routing table according to the basic configuration

RemotePlacementDriverClient is responsible for the control of routing table RegionRouteTable, such as obtaining Store, routing, Leader node information, etc.

RemotePlacementDriverClient also contains CliService. The replication node can be operated and maintained by external CliService, such as addReplica, removeReplica, transferLeader.

Summary

Because many traditional storage middleware do not support distributed originally, there is little body feeling all the time. Raft protocol is a set of consensus protocol that is easy to understand. SOFAJRaft is a very good code and engineering example. At the same time, RheaKV is also a very lightweight embedded database that supports multiple storage structures. Writing a code analysis article is also a process of learning and progress, from which we can also peek into some basic database implementations, hoping that the community can build more flexible and self-governing systems and applications on the basis of SOFAJRaft / RheaKV.

At this point, the study of "SOFAJRaft-RheaKV MULTI-RAFT-GROUP implementation Analysis of the implementation principle of SOFAJRaft" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.