How to construct commitlog Repository DLedger based on raft Protocol 04/27 Update SLTechnology News&Howtos

How to construct commitlog Repository DLedger based on raft Protocol

2025-04-27 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article shows you how to use DLedger, a commitlog repository based on raft protocol, which is concise and easy to understand. It will definitely brighten your eyes. I hope you can get something through the detailed introduction of this article.

I. the purpose of introducing DLedger

Prior to RocketMQ 4.5, RocketMQ can only be deployed in Master/Slave. There is one Master in a group of broker and there are zero to more than one.

Slave,Slave synchronizes Master data through synchronous replication or asynchronous replication. The Master/Slave deployment model provides some high availability.

However, this deployment model has some defects. For example, in terms of failover, if the master node dies, it also needs to restart or switch manually, so it is impossible to automatically convert a slave node to the master node. Therefore, we hope to have a new multi-copy architecture to solve this problem.

The new multi-copy architecture first needs to solve the problem of automatic failover, which is essentially a problem of automatic selection. There are basically two solutions to this problem:

Use a third-party coordination service cluster to select the host, such as zookeeper or etcd. This solution will introduce heavyweight external components, increase the deployment, operation and troubleshooting costs, such as maintaining the RocketMQ cluster also need to maintain the zookeeper cluster, and zookeeper cluster failure will affect the RocketMQ cluster.

Raft protocol is used to complete an automatic master selection. Compared with the former, the advantage of raft protocol is that there is no need to introduce external components, the automatic master selection logic is integrated into the process of each node, and the master selection can be completed through communication between nodes.

Therefore, we finally choose to use raft protocol to solve this problem, and DLedger is a commitlog repository based on raft protocol, and it is also the key for RocketMQ to implement a new high-availability multi-replica architecture.

Second, DLedger design concept 1. DLedger positioning

On the other hand, DLedger is a lightweight java library. Its external API is very simple, append and get. Append adds data to DLedger, and the added data corresponds to an incremental index, and get can get the corresponding data according to the index. So DLedger is an append only logging system.

2. DLedger application scenarios

One of the applications of DLedger is in the distributed messaging system. After the release of RocketMQ version 4.5, it can be deployed in RocketMQ on DLedger mode. DLedger commitlog replaces the original commitlog, making commitlog have the ability of election replication, and then through the role transmission way, the raft role is transmitted to the external broker role, and the leader corresponds to the original master,follower and candidate corresponding to the original slave.

So RocketMQ's broker has the ability to fail over automatically. In a set of broker, after the Master is dead, relying on the DLedger automatic master selection ability, the leader will be re-selected and then transformed into the new Master through role transmission.

DLedger can also build highly available embedded KV storage. We record the operations of some data in DLedger, and then restore them to hashmap or rocksdb according to the amount of data or actual needs, so as to build a consistent and highly available KV storage system, which can be applied to scenarios such as meta-information management.

Third, the optimization of DLedger. Performance optimization

The Raft protocol replication process can be divided into four steps: first, send a message to leader,leader, in addition to local storage, copy the message to follower, and then wait for follower confirmation. If confirmed by most nodes, the message can be submitted and return the successful confirmation to the client. How to optimize this replication process in DLedger?

(1) Asynchronous threading model

DLedger uses an asynchronous threading model, which reduces waits. In a system, if there are fewer blocking points, each thread can reduce the wait when processing the request, so that it can make better use of CPU and improve throughput and performance.

The DLedger asynchronous threading model is described in terms of the whole process of DLedger processing Append requests. In the figure, the thick arrow represents the RPC request, the implementation arrow represents the data flow, and the dotted line represents the control flow.

First of all, the client sends the Append request, which is handled by the communication module of DLedger. The current default communication module of DLedger is implemented by Netty, so the Netty IO thread will send the request to the thread in the business thread pool for processing, and then the IO thread will return directly to process the next request. There are three steps for the business processing thread to process the Append request. The first step is to write the Append data to its own log, that is, pagecache. The Append CompletableFuture is then generated and put into a Pending Map, which is a decision state because the log has not been confirmed by the majority. The third step wakes up the EnrtyDispatcher thread and tells it to copy the log to follower. After the three steps are completed, the business thread can process the next Append request with little waiting in between.

On the other hand, the replication thread EntryDispatcher will copy the log to the follower, and each follower corresponds to an EntryDispatcher thread. The thread records its own replication site corresponding to the follower, and notifies the QurumAckChecker thread every time the site moves. This thread will judge whether a log has been copied to most nodes according to the condition of the replication site. If it has been copied to most nodes, the log can be submitted and the corresponding Append CompletableFuture can be completed. The communication module is notified to return a response to the client.

(2) Independent and concurrent replication process

In DLedger, leader sends logs to all follower independently and concurrently. Leader allocates a thread for each follower to copy the log and record the corresponding replication sites, and then a separate asynchronous thread detects whether the log is copied to most nodes according to the location situation and returns it to the client.

(3) Log parallel replication

The traditional linear replication is that leader replicates the log to follower, and then replicates the next log entry after follower confirmation, that is, leader waits for follower to confirm the previous log before copying the next log. This replication method ensures the order and no error, but the throughput is very low and the delay is relatively high, so DLedger designs and implements the scheme of log parallel replication, which no longer needs to wait for the previous log replication to be completed and then replicate the next log. It only needs to maintain a list of requests sorted according to the log index in follower, and the follower thread processes these replication requests sequentially according to the index order. However, the problem of missing data may occur after parallel replication, which can be solved by a small amount of data retransmission.

two。 Reliability optimization

(1) Optimization of network partition by DLedger.

If the network partition in the figure above occurs, N2 is isolated from other nodes in the cluster. According to the raft paper implementation, N2 will always ask for a vote, but will not get a majority of votes, and the term will continue to grow. Once the network is restored, N2 will interrupt N1 and N3, which are being copied normally, and conduct a re-election. In order to solve this situation, the implementation of DLedger improves the raft protocol, and the voting request process is divided into several stages, of which there are two important stages: WAIT_TO_REVOTE and WAIT_TO_VOTE_NEXT. WAIT_TO_REVOTE is the initial state in which term,WAIT_TO_VOTE_NEXT is not added when requesting a vote, and term is added before the next round of request voting begins. In the case of N2 in the figure, when the number of valid votes does not reach a large number. You can set the node state to WAIT_TO_REVOTE,term so that it does not increase. Through this method, the tolerance of Dledger to network partition is improved.

(2) DLedger reliability test

DLedger also has a very high fault tolerance. It can tolerate a variety of reasons that prevent the node from working properly, such as:

The ● process crashed abnormally

● machine node crashes abnormally (machine power off, operating system crashes)

● slow node (Full GC,OOM, etc.)

● network failure, various network partitions

In order to verify the tolerance of DLedger to these faults, in addition to a variety of local tests on DLedger, the distributed system verification and fault injection framework Jepsen is also used to detect the problems of DLedger and verify the reliability of the system.

The main purpose of the Jepsen framework is to verify whether the system meets the consistency under specific faults. The Jepsen verification system consists of six nodes, one control node (Control Node) and five DB nodes (DB Node). The control node can log in to the DB node through the SSH. Through the control of the control node, the distributed system can be downloaded and deployed in the DB node to form a cluster to be tested. After the test starts, the control node creates a set of Worker processes, and each Worker has its own distributed system client. Generator generates the actions performed by each client, and the client process applies the operations to the distributed system to be tested. The beginning and end of each operation and the results of each operation are recorded in the history. At the same time, a special Client process Nemesis introduces the fault into the system. At the end of the test, Checker analyzes whether the history is correct and consistent.

According to DLedger positioning, it is a commitlog repository based on raft protocol and an append only log system, which is tested by Jepsen's Set model. The testing process of the Set model is divided into two phases. In the first stage, different clients add different data to the cluster to be tested concurrently, and fault injection will be carried out in the middle. In the second stage, a final read is made to the cluster to be tested, and the read result set is obtained. Finally, verify that each successfully added element is in the final result set, and that the final result set contains only the elements you are trying to add.

The figure above shows one of the test results of DLedger. 30 client processes add data to the DLedger cluster to be tested concurrently, and random symmetrical network partitions are introduced in the middle. The default interval for fault introduction is 30s, that is, 30s normal operation, 30s fault introduction, 30s normal operation, 30s fault introduction, and continuous cycle. The whole phase lasts a total of 600s. You can see that the last one sent a total of 160000 data, there is no data loss, lost-count=0, there is no data that should not exist, uexpected-count=0, the conformance test passed.

The figure above shows each operation of the client to the DLedger cluster during the test. A small blue box indicates successful addition, a small red box indicates failure, a small yellow box indicates uncertainty (such as most authentication timeouts), and the gray part of the figure indicates the time period during which the fault was introduced. It is reasonable to see that some failures lead to temporary unavailability of the cluster and some failure periods do not. Because it is random network isolation, we need to see if the isolated nodes will cause the cluster to be re-elected. However, even if the cluster is re-elected, the DLedger cluster will return to availability after a period of time.

In addition to testing symmetric network partition faults, we also test the Dledger performance under other faults, including randomly killing nodes, randomly pausing the processes of some nodes to simulate the status of slow nodes, and complex asymmetric network partitions such as bridge, partition-majorities-ring and so on. Under these faults, DLedger ensures consistency and verifies that DLedger has good reliability.

IV. Future Development of DLedger

DLedger's next plans include:

● Leader node priority selection

Jepsen testing of ● RocketMQ on DLedger

● Runtime membership change

● adds observers (only participate in replication, not vote)

● to build highly available Kramp V Storage

●...

DLedger is now a project under OpenMessaging, and students from the community are welcome to join us to build a high-availability and high-performance commitlog repository.

The above content is based on the raft protocol commitlog repository DLedger how to use, have you learned the knowledge or skills? If you want to learn more skills or enrich your knowledge reserve, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.