How to analyze the principle of SOFAJRaft implementation 07/11 Update SLTechnology News&Howtos

How to analyze the principle of SOFAJRaft implementation

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Today, I will talk to you about how to analyze the implementation principle of SOFAJRaft. Many people may not understand it very well. In order to make you understand better, the editor has summarized the following contents for you. I hope you can get something according to this article.

The implementation details of SOFAJRaft storage module are described in detail from three aspects: Log log storage LogStorage, Meta meta-information storage RaftMetaStorage and Snapshot snapshot storage SnapshotStorage, which visually depict the storage log, Raft configuration and mirror process between SOFAJRaft Server nodes Node.

SOFAStack

Scalable Open Financial Architecture Stack

It is a financial-level distributed architecture independently developed by Ant Financial Services Group, which includes all the components needed to build a financial-level cloud native architecture, and is the best practice honed in the financial scenario.

SOFAJRaft is a production-level high-performance Java implementation based on Raft consistency algorithm, which supports MULTI-RAFT-GROUP and is suitable for scenarios with high load and low latency.

Preface

SOFAJRaft is a production-level high-performance Java implementation based on Raft consistency algorithm, which supports MULTI-RAFT-GROUP and is suitable for scenarios with high load and low latency.

The SOFAJRaft enclosure is divided into:

Log Storage logs Raft configuration changes and user-submitted tasks

Meta storage, that is, meta-information storage records the internal state of the Raft implementation

Snapshot stores the user's state machine Snapshot and meta information.

This paper will analyze the principle of SOFAJRaft storage module around the aspects of log storage, meta-information storage and snapshot storage, and explain how to solve the problem of Raft protocol storage and the implementation of storage module.

How is the log of Raft configuration changes and user submitted tasks stored? How do I call administrative log storage?

How does the SOFAJRaft Server node Node store the Raft internal configuration?

How to implement the Raft state machine snapshot Snapshot mechanism? How do I store the installation image?

Cdn.nlark.com/yuque/0/2019/png/156670/1556492476096-9300c652-29e2-4698-b5ef-435c294e00c6.png ">

Log storage

Log storage logs Raft configuration changes and user-submitted tasks, and copies logs from Leader to other nodes.

LogStorage is a log storage implementation. The default implementation is based on RocksDB storage, and custom log storage is implemented by extending the LogStorage interface.

LogManager is responsible for invoking the underlying log storage LogStorage, caching, batch submission, necessary checks and optimizations for log storage calls.

LogStorage storage implementation

LogStorage log storage implementation. The core API interface of the Log enclosure that defines the Raft packet node Node includes:

Returns the first / last log index in the log

Obtain Log Entry and its term of office according to log index

Add single / batch Log Entry to log store

Delete the log from the Log storage header / end

Delete all existing logs and reset the next log index.

The tasks submitted by Log Index to Raft Group are serialized into log storage, with a number for each log, monotonously incremented throughout the Raft Group and copied to each Raft node. LogStorage Log Storage implementation API definition entry:

Com.alipay.sofa.jraft.storage.LogStorageRocksDBLogStorage is implemented based on RocksDB

Log Structured Merge Tree, referred to as LSM, divides a big tree into N small trees. The data is first written to memory, and an ordered small tree is built in memory. As the small tree gets bigger and bigger, the small tree in memory Flush to the disk, and the trees in the disk are merged regularly to form a big tree to optimize read performance. By converting random writes to sequential writes on the disk, write performance is improved. RocksDB is an embedded KV storage engine written in C++ based on the LSM-Tree data structure, and its keys allow the use of binary streams. RocksDB organizes all data sequentially, and common operations include get (key), put (key), delete (Key), and newIterator (). RocksDB has three basic data structures: memtable,sstfile and logfile. Memtable is an in-memory data structure-all write requests go into memtable and then optionally into logfile. Logfile is an ordered write storage structure that is brushed to the sstfile file and stored when the memtable is filled, and the associated logfile is then safely deleted. The data in sstfile is sorted so that it can be searched quickly according to key.

The default implementation of LogStorage RocksDBLogStorage is based on RocksDB log storage. Initialize log storage StorageFactory creates RocksDBLogStorage log storage by default according to the log storage path of Raft node and whether to call fsync configuration within Raft internal implementation. The core operations of RocksDBLogStorage based on RocksDB storage include:

Init (): create RocksDB configuration options call the RocksDB#open () method to build the RocksDB instance, add the default default column family and its configuration options to get the column family processor, and use newIterator () to generate the RocksDB iterator to traverse the KeyValue data to check the Value type to load the Raft configuration changes to the configuration manager ConfigurationManager. RocksDB introduces the concept of column family ColumnFamily. The so-called column family refers to a data set composed of a series of KeyValue. RocksDB read and write operations need to specify column families, and create RocksDB default build column families named default.

Shutdown (): first close the column family processor and the RocksDB instance, then iterate through the column family configuration option to perform the close operation, then turn off the RocksDB configuration option, and finally clear the strong reference to reach the Help GC garbage collection RocksDB instance and its configuration option object.

GetFirstLogIndex (): build the RocksDB iterator RocksIterator based on the processor defaultHandle and the read option totalOrderReadOptions method to check whether the first log index in the log has been loaded. If not, you need to call the seekToFirst () method to get the first log index that caches RocksDB to store log data.

GetLastLogIndex (): builds the RocksDB iterator RocksIterator based on the processor defaultHandle and the read option totalOrderReadOptions, and calls the seekToLast () method to return the last log index where the RocksDB stores the log record.

GetEntry (index): calls the RocksDB#get () operation based on the processor defaultHandle and the specified log index to return the RocksDB index location log LogEntry.

GetTerm (index): based on the processor defaultHandle and the specified log index, the RocksDB#get () operation is called to get the RocksDB index location log and return its LogEntry term.

AppendEntry (entry): check whether the log LogEntry type is a configuration change. The configuration change type calls the RocksDB#write () method to perform batch writes, and the log of the task submitted by the user is stored by calling the RocksDB#put () method based on the processor defaultHandle and LogEntry object.

AppendEntries (entries): call the RocksDB#write () method to write the log synchronous flushing of Raft configuration changes or tasks submitted by users to RocksDB storage in batches, and merge IO write requests by Batch Write means to reduce method calls and context switching.

TruncatePrefix (firstIndexKept): gets the first log index, and starts a thread in the background to RocksDB#deleteRange () based on the default processor defaultHandle and configuration processor confHandle to delete RocksDB log data from the Log header with the first log index to the specified index location.

TruncateSuffix (lastIndexKept): gets the last log index and performs a RocksDB#deleteRange () operation based on the default processor defaultHandle and configuration processor confHandle to clean up the RocksDB uncommitted logs from the end of the Log to specify the index location to the last index category.

Reset (nextLogIndex): get the LogEntry corresponding to the nextLogIndex index, execute the RocksDB#close () method to close the RocksDB instance, call the RocksDB#destroyDB () operation to destroy the RocksDB instance to clean up all RocksDB data, reinitialize and load the RocksDB instance and reset the next log index location.

RocksDBLogStorage implements the core entry based on RocksDB storage logs:

Com.alipay.sofa.jraft.storage.RocksDBLogStorageLogManager storage call

Log manager LogManager is responsible for calling Log log storage LogStorage for cache management, batch submission, check and optimization of LogStorage calls. Raft packet node Node initializes log storage StorageFactory to build log manager LogManager at initialization / startup, and instantiates LogManager based on LogManagerOptions configuration options such as log storage LogStorage, configuration manager ConfigurationManager, limited state machine caller FSMCaller, node performance monitoring NodeMetrics, etc. Generate a steady state callback StableClosure event Disruptor queue according to the Raft node Disruptor Buffer size configuration, and set a steady state callback StableClosure event handler StableClosureEventHandler to handle queue events. When the StableClosureEventHandler processor event is triggered, determine whether the Log Entries of the task callback StableClosure is empty. If the Log Entries of the task callback is a batch Flush of non-empty accumulated log entries If it is empty, check the StableClosureEvent event type and call the underlying storage LogStorage#appendEntries (entries) to write the batch submission log to RocksDB. When the event type is SHUTDOWN, RESET, TRUNCATE_PREFIX, TRUNCATE_SUFFIX, LAST_LOG_ID, call the underlying log storage LogStorage for specified event callback ResetClosure, TruncatePrefixClosure, TruncateSuffixClosure, LastLogIdClosure processing.

When Client sends a command to SOFAJRaft, the log manager LogManager of Raft packet node Node first stores the command locally in the form of Log. When calling the appendEntries (entries, done) method to check that the Node node is currently Leader and Entries comes from the correct log index unknown to the user, the index needs to be assigned to the added log Entries, while when it is currently Follower and Entries originates from Leader, it is necessary to check and resolve the conflict between the local log and Entries. Then traverse the log entry Log Entries to check whether the type is configuration change, the configuration manager ConfigurationManager caches the configuration change Entry, adds the existing log entry Entries to logsInMemory for caching, the steady state callback StableClosure sets the log to be stored, issues the OTHER type event to the steady state callback StableClosure event queue, and triggers the steady state callback StableClosure event handler StableClosureEventHandler to handle the event. The processor acquires the Log Entries of the task callback and accumulates the log entries into memory for subsequent unified batch Flush, and calls the underlying LogStorage storage log Entries through the appendToStorage (toAppend) operation. At the same time, Replicator copies this Log to other Node to achieve concurrent log replication. When Node receives the "copy successful" response returned by more than half of the Node in the cluster, the Log and the previous Log are sent to the state machine in an orderly manner for execution.

LogManager call log store LogStorage implementation logic

Meta-information storage

Metadata storage, or meta-information storage, is used to store and record the internal state of the Raft implementation, such as the current term Term, which PeerId node to vote for, and so on.

RaftMetaStorage storage implementation

RaftMetaStorage meta-information storage implementation. The core API interface of the Metadata enclosure that defines Raft metadata includes:

Set / get the current term Term of Raft metadata

Vote on the PeerId node that assigns / queries Raft meta-information.

Raft internal status term Term is a monotonously increasing long number throughout the Raft Group, which is used to indicate the number of a round of voting, in which the Term corresponding to the successfully elected Leader is called Leader Term,Leader. All logs submitted during the period when there is no change have the same Term number. PeerId represents the participant of the Raft protocol (Leader/Follower/Candidate etc.) and consists of three elements: ip:port:index, where ip is the IP of the node, port is the port, and index represents the serial number of the same port. RaftMetaStorage meta-information storage implementation API definition entry:

Com.alipay.sofa.jraft.storage.RaftMetaStorageLocalRaftMetaStorage is implemented based on ProtoBuf

Protocol Buffers is a portable and efficient structured data storage format, which is used for structured data serialization or serialization. It is suitable for data storage or RPC data exchange format. It is used for language-independent, platform-independent and extensible serialized structural data format in communication protocols, data storage and other fields. The user defines the Message type of Protocol Buffer in .proto file to specify the data structure to be serialized, each Message is a small information logic unit containing a series of key-value pairs, each type of Message covers one or more unique encoded fields, each field is composed of name and value type, allowing Message to define optional field Optional Fields, required field Required Fields, repeatable field Repeated Fields.

The default implementation of LocalRaftMetaStorage for RaftMetaStorage is based on ProtoBuf Message local storage of Raft metadata. Initializing meta-information storage StorageFactory creates LocalRaftMetaStorage meta-information storage by default according to Raft meta-information storage path, Raft internal configuration and Node node monitoring. The main operations of implementing LocalRaftMetaStorage based on ProtoBuf storage include:

Init (): get the Raft meta-information storage configuration RaftMetaStorageOptions node Node, read the ProtoBufFile file named raft_meta and load the StablePBMeta message, and cache the Raft current term Term and PeerId node voting information according to the StablePBMeta ProtoBuf metadata.

Shutdown (): get the in-memory Raft current term Term and PeerId node vote to build the StablePBMeta message, and write the ProtoBufFile file according to whether the Raft internal synchronization metadata configuration.

SetTerm (term): check the initialization status of the LocalRaftMetaStorage, cache the current term Term, and save the current term Term to the ProtoBufFile file as a ProtoBuf message according to whether the Raft synchronizes the metadata configuration.

GetTerm (): checks the LocalRaftMetaStorage initialization status and returns the cached current term Term.

SetVotedFor (peerId): check the LocalRaftMetaStorage initialization status, cache the voting PeerId node, and save the voting PeerId node to the ProtoBufFile file as a ProtoBuf message according to whether the Raft synchronizes the metadata configuration.

GetVotedFor (): checks the LocalRaftMetaStorage initialization status and returns the cached voting PeerId node.

LocalRaftMetaStorage implements the entry based on ProtoBuf local storage of Raft meta-information:

Com.alipay.sofa.jraft.storage.impl.LocalRaftMetaStorage snapshot storage

When the Raft node Node is restarted, the state data of the state machine in memory is lost, which triggers the startup process to restore all logs stored in LogStorage and reconstruct the entire state machine instance. This scenario will cause two problems:

If the task is submitted frequently, for example, the message middleware scenario causes the whole reconstruction process to be long and slow to start.

If there are a lot of logs and the node needs to store all the logs, the resource consumption is not sustainable for storage.

If the Node node is added, the new node needs to get all the logs from Leader and store them back to the state machine, which is a big burden on Leader and network bandwidth.

Therefore, the introduction of Snapshot mechanism to solve these three problems, the so-called snapshot Snapshot is the record of the current value of the data, is to build a "mirror" for the latest state of the current state machine to save separately, save the log before the successful deletion of this time to reduce the log storage. Load the latest Snapshot image directly at startup, and then replay the log after that. If the interval between Snapshot is reasonable, the whole process of replaying to the state machine is faster and the startup process is accelerated. Finally, the addition of the new node first copies the latest Snapshot from Leader to the local state machine, and then just copies the subsequent logs, which can quickly keep up with the progress of the whole Raft Group. Leader generates snapshots for several purposes:

When a new node Node joins the cluster, you do not need to rely solely on log replication, playback mechanism and Leader to keep the data consistent. Skip the playback of a large number of logs by installing Leader snapshots

Leader uses snapshots instead of Log replication to reduce the amount of data on the network side

Replace earlier Log with snapshots to save storage space.

Snapshot storage, which stores the user's state machine Snapshot and meta information:

SnapshotStorage for Snapshot storage implementation

SnapshotExecutor is used to manage Snapshot storage, remote installation, replication.

SnapshotStorage storage implementation

SnapshotStorage snapshot storage implementation, the core API interface of the Snapshot enclosure that defines the Raft state machine includes:

Set filterBeforeCopyRemote to filter data before true replicates remotely

Create a snapshot writer

Open snapshot reader

Copy data from a remote Uri

Start a replication task to replicate data from a remote Uri

Configure SnapshotThrottle,SnapshotThrottle to limit traffic in redisk read / write scenarios, such as disk read and write, network bandwidth.

LocalSnapshotStorage is implemented based on local files

By default, SnapshotStorage implements LocalSnapshotStorage based on local file storage Raft state machine image. Initializing meta-snapshot storage StorageFactory creates LocalSnapshotStorage snapshot storage by default based on Raft image snapshot storage path and Raft configuration information. The main methods to implement LocalSnapshotStorage based on local file storage include:

Init (): delete the temporary mirror Snapshot with the file name temp, destroy the old snapshot Snapshot with the file prefix snapshot_, and get the last index lastSnapshotIndex of the snapshot.

Close (): rename the temporary mirror Snapshot file according to the last index lastSnapshotIndex of the snapshot and the LocalSnapshotWriter snapshot index of the mirror writer, and destroy the snapshot of the writer LocalSnapshotWriter storage path.

Create (): destroys the temporary snapshot Snapshot named temp, creates an initialization snapshot writer LocalSnapshotWriter based on the temporary mirror storage path, and loads the Raft snapshot metadata named _ _ raft_snapshot_meta into memory.

Open (): according to the last index of the snapshot, lastSnapshotIndex obtains the file prefix as the snapshot_ snapshot storage path, creates an initialization snapshot reader LocalSnapshotReader based on the snapshot storage path, and loads the Raft mirror metadata named _ _ raft_snapshot_meta into memory.

StartToCopyFrom (uri, opts): create initialization state machine snapshot replicator LocalSnapshotCopier, generate remote file replicator RemoteFileCopier, obtain Raft client RPC service connection specified Uri based on remote service address Endpoint, start background thread to copy Snapshot mirror data, load Raft snapshot metadata to obtain remote snapshot Snapshot image file, read remote specified snapshot storage path data copy to BoltSession, snapshot replicator LocalSnapshotCopier synchronizes Raft snapshot metadata.

SnapshotExecutor storage management

Snapshot executor SnapshotExecutor is responsible for Raft state machine Snapshot storage, Leader remote installation snapshot, and copying mirror Snapshot files, including two core operations: state machine snapshot doSnapshot (done) and installation snapshot installSnapshot (request, response, done). StateMachine snapshot doSnapshot (done) gets the Snapshot storage snapshot writer LocalSnapshotWriter based on the path of the temporary mirror temp file, and loads the _ _ raft_snapshot_meta snapshot metadata file initialization writer The build save image callback SaveSnapshotDone provides the state transition of FSMCaller calling StateMachine to publish SNAPSHOT_SAVE type task events to the Disruptor queue, triggering the application task processor ApplyTaskHandler to run snapshot saving tasks through Ring Buffer, and calling the onSnapshotSave () method to store snapshots of various types of state machines. Remote installation snapshot installSnapshot (request, response, done) creates and registers snapshot download job DownloadingSnapshot according to installation image request response and snapshot original information, loads snapshot download DownloadingSnapshot to obtain the reader SnapshotReader of the current snapshot copier, builds installation image callback InstallSnapshotDone, assigns FSMCaller state transition to StateMachine, publishes SNAPSHOT_LOAD type task events to the Disruptor queue, and also triggers the application task processor ApplyTaskHandler to perform snapshot installation tasks through Ring Buffer Call the onSnapshotLoad () operation to load various types of state machine snapshots.

SnapshotExecutor state machine snapshots and remote installation images implement logic:

After reading the above, do you have any further understanding of how to analyze the principle of SOFAJRaft implementation? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.