Method tutorial of Raft algorithm in distributed Storage system Curve 04/16 Update SLTechnology News&Howtos

Method tutorial of Raft algorithm in distributed Storage system Curve

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/03 Report--

This article mainly explains the "Raft algorithm in the distributed storage system Curve method tutorial", the content of the article is simple and clear, easy to learn and understand, now please follow the editor's ideas slowly in-depth, together to study and study the "Raft algorithm in the distributed storage system Curve method tutorial"!

Introduction of Raft consistency algorithm

In the Raft algorithm, there are three roles: Leader, Follower, and Candidate. The transformation relationship between them is shown below:

There can be only one Leader at a time, from being elected Leader to initiating the next election called a term (term). Leader is responsible for maintaining rule to the follower through the heartbeat and synchronizing data to the Follower. The premise that Follower initiates an election is that the heartbeat of Leader is not received during the timeout.

Leader campaign: RAFT is an Majority agreement in which the condition for winning an election is to get votes from a majority of nodes, including yourself. If Follower does not receive Leader's heartbeat during the timeout, it will become Candidate and launch a new round of elections. Each node can only vote once in each Term and obeys the first-come-first-served principle. To avoid multiple Follower timeouts at the same time, the election timeout in raft is a fixed time plus a random time.

Log replication: during the term of office, Leader receives a request from client, encapsulates it as a log entry (Raft Log Entry), append it into its own raft log, and then launches AppendEntries RPC to other servers in parallel. When it is determined that the entry has been successfully replicated by most nodes (a process called commit), it can execute the command (this step is called apply) and return the client result. Log entry consists of three parts: 1, log index,2, the term,3 to which log belongs, and the command to be executed.

Configuration changes: in Raft, what members of the replication group are called configurations? the configuration is not fixed. Nodes will be added or deleted according to requirements or the problematic nodes need to be replaced in abnormal cases. It is not safe to switch directly from one configuration to another because different servers switch at different points in time. Therefore, when the configuration of Raft is changed, a special log entry Cold,new will be created first. after this entry is commit, it will enter the common consistent stage, that is, the new and old configurations will make decisions together. At this time, the log entry of a Cnew is regenerated, and after the entry is commit, it can be decided independently by the new configuration.

Installation snapshot: a Raft snapshot is a collection of system states saved at some point. Snapshots have two functions: one is log compression. After snapshots are taken, the log entry before this time can be deleted. The other is to start acceleration, when the system gets up, there is no need to replay all the logs. When the Leader synchronization log is sent to Follower, it is found that the required log entry has been deleted by the snapshot, and you can synchronize by sending InstallSnapshot RPC to Follower.

Application of Raft algorithm in Curve

Curve system is a distributed storage system. In Curve system, the smallest unit of data slicing is called Chunk, and the default Chunk size is 16MB. Under the large-scale storage capacity, a large number of Chunk will be generated. So many Chunk will exert some pressure on the storage and management of metadata. Therefore, the concept of CopySet is introduced. CopySet can be understood as a collection of ChunkServer, one Copyset manages multiple Chunk, and synchronization and management between multiple replicas are organized as Copyset units. The relationship between ChunkServer,Copyset and Chunk is shown in the following figure:

Curve copyset chooses braft as the component of the conformance protocol. It is used as multi-raft, that is, the same machine can belong to multiple replication groups. Conversely, there can be multiple raft instances on one machine. Based on braft, we have implemented data synchronization between replicas, system scheduling, lightweight raft snapshots and other functions, which are described in detail below.

Data synchronization between replicas

CopysetNode encapsulates braft node on the ChunkServer, as shown in the following figure:

When Curve client sends a write request, the specific steps in the case of three copies are as follows:

Client sends a write request to Leader ChunkServer.

ChunkServer receives the request, encapsulates it into a log entry, and submits it to raft.

The braft module sends entry to other replicas (ChunkServer) while persisting the entry locally.

If the local persistence entry is successful, and the other copy is also successful, then the commit.

Execute apply,apply after commit, and then execute our write disk operation.

In this process, the user's data and operations are transferred between replicas through log entry in braft, and the data is synchronized. In the three-copy scenario, the time to return to the upper layer depends on the speed of the two faster copies, so the impact of the slow disk can be reduced. For the slower copy, leader also synchronizes the data through infinite retries, so the data of the final three replicas is consistent under the premise that the system is working properly.

System scheduling based on Raft

In Curve, ChunkServer regularly reports the heartbeat to the metadata node MDS. In addition to some statistical information of ChunkServer itself, the heartbeat also contains the statistical information of CopySet on ChunkServer, including its leader, replication group members, whether there are configuration changes in execution, configuration epoch, and so on. Based on these statistics, MDS generates a series of Raft configuration change requests and sends them to the ChunkServer where Copyset's leader is located.

Issue configuration changes

Curve ChunkServer regularly reports the heartbeat to MDS. The configuration changes dispatched by MDS are done in the response of the heartbeat. The process of reporting a heartbeat is shown below:

Heartbeat is triggered by a timing task. In addition to reporting some statistical information such as its own capacity information, ChunkServer will also report some information about Copyset, such as leader, member, epoch, whether there are any configuration changes in progress, and so on.

The process of MDS sending configuration changes in heartbeat response is as follows:

After ChunkServer receives the response, it parses the configuration change information and sends it to each Copyset.

Curve epoch

Epoch synchronization configuration. The scheduling information generated by MDS is triggered by a scheduled task in the background and sent in response when the next request from ChunkServer arrives, so the configuration changes issued by MDS may be expired. In order to achieve configuration synchronization between MDS and ChunkServer, Curve introduces the epoch mechanism. The initial state of epoch is 0, and epoch will be added to each configuration change (including leader change). Configuration changes that are about to be issued are considered valid only when the epoch in MDS is equal to the epoch on the ChunkServer side. What is the difference between epoch and term? Term is used to indicate the term of office of Leader, that is, it is only related to election, while epoch is related to configuration changes, including the case of Leader election.

Updates to epoch. Epoch is updated on the ChunkServer side. Braft provides a user state machine in the implementation. The corresponding functions in the user state machine are called when changes occur within the braft, such as apply, error, shutdown, snapshot, loading snapshot, leader change, configuration change, etc. Curve copyset implements the user state machine to interact with braft through inheritance, and epoch adds one to the on_configuration_committed function. In braft, the current configuration is submitted again when the Leader changes, so adding epoch to the on_configuration_committed ensures that the epoch order is incremented when the configuration changes or the Leader changes.

Persistence of epoch. On the MDS side, the epoch is persisted in the etcd along with the other information of the CopySet. ChunkServer also persists epoch, but persisting epoch in ChunkServer does not need to be persisted every time epoch changes. This takes advantage of raft's log playback and snapshot capabilities. Consider the following two situations:

Assuming that the raft does not take a snapshot, there is no need to persist the epoch, because all operation logs, including the entry of configuration changes, have been persisted. When the service is restarted, on_configuration_committed will be called again when the logs are played back, and finally the epoch will revert to the value before restart.

When the raft has a snapshot, the entry before taking the snapshot will be deleted, so it cannot be played back in the above way, so it must be persisted. But we only need to persist the current value of the raft when taking an epoch snapshot. In this way, when the system is rebooted, the raft snapshot will be installed first, and the epoch will be restored to the value of the snapshot after installation, and then the epoch will be restored to the value before the restart by executing the following log entry. Updating epoch when taking a snapshot is done in the on_snapshot_save function.

Raft lightweight snapshot

As mentioned above when introducing the Raft algorithm, Raft needs to take regular snapshots to clean up the old log entry, otherwise the Raft log will grow indefinitely. When taking a snapshot, you need to save the current state of the system. For Curve block storage scenarios, the system state is the current data of Chunk. The intuitive solution is to copy and back up all the chunk at the time of taking a snapshot. But there are two problems:

There should be twice as much space, and the waste of space is very serious.

By default, Curve takes snapshots every 30 minutes. Under this scheme, frequent data copies will be made, which will put great pressure on the disk and affect the normal IO.

Therefore, the Raft snapshot used in Curve is lightweight, that is, when you take a snapshot, only the list of current Chunk files is saved, and the Chunk itself is not backed up. The specific process is as follows:

In this way, the chunk file downloaded by Follower is not the status of the snapshot, but the latest status. When the log is played back, the new data will be written again. But this is acceptable for our scenario, because the underlying overwrite is idempotent, that is, the result of writing once is consistent with writing multiple times.

Thank you for reading, the above is the content of "the method tutorial of Raft algorithm in distributed storage system Curve". After the study of this article, I believe you have a deeper understanding of the method tutorial of Raft algorithm in distributed storage system Curve, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.