What is the Jepsen test method of DLedger 07/02 Update SLTechnology News&Howtos

What is the Jepsen test method of DLedger

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article mainly explains "what is the Jepsen testing method of DLedger". The content of the explanation is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "what is the Jepsen testing method of DLedger?"

Challenges faced by distributed systems

Is it better to be alive and wrong or right and dead? With the development of computer technology, the system architecture has evolved from centralized to distributed. Distributed systems provide better scalability, fault tolerance and lower latency than a single machine, but there is a fundamental difference between running software on a single computer and running software on a distributed system. one of them is that the software is running on a single computer, and the error is predictable. When there is no hardware failure, the software running on a single computer always produces the same result; if there is a problem with the hardware, then the result is often the failure of the whole system. Therefore, for a single system, either the function is intact and correct, or it is completely ineffective, not somewhere in between.

Distributed systems are much more complex. Distributed systems involve multiple nodes and networks, so there is a problem of partial failure. An unreliable network in a distributed system may cause packet loss or arbitrary delay, and an unreliable clock may cause a node to be out of sync with other nodes. even processes on one node may be suspended for a long time at any time (such as caused by garbage collector) and be declared dead, which bring uncertainty and unpredictability to the distributed system. In fact, these problems are unavoidable in distributed systems. As pointed out in the famous CAP theory, P (network partitioning) exists forever, not optional.

Since faults are unavoidable in distributed systems, the easiest way to deal with failures is to invalidate the whole service and let the application "die correctly", but not all applications can accept it. Failover attempts to solve this problem by promoting one of the slave libraries to the master library when a failure occurs, so that the new master library still provides services. But problems such as inconsistencies in master-slave data and brain cracks may make the application "live wrongly". Code hosting site Github in an accident, because an outdated MySQL slave library was promoted to the master library, resulting in data inconsistency between MySQL and Redis, resulting in some private data leaked to the wrong users. In order to reduce the impact of failure, we need some means to ensure data consistency, and how to verify that large-scale distributed systems are still correct and stable (reliability) under failure has become a new problem.

Reliability verification

The verification of distributed system reliability can be carried out by formal specifications, such as TLA+, but such verification requires a lot of specific theoretical knowledge. Another way is to verify through testing, but ordinary unit testing and integration testing can not cover some edge situations that only occur when high concurrency or failure occurs, which brings new challenges to distributed system testing.

The emergence of chaos engineering has brought new verification ideas, enterprises need to find problems in the testing phase, through "deliberately" to cause faults to ensure that the fault-tolerant mechanism continues to run and be tested, so as to improve the confidence that the system can handle correctly when the fault occurs naturally. Pavlos Ratis, who was born in SRE, maintains books, tools, papers, blogs, news, conferences, forums and Twitter accounts related to chaos Engineering in his GitHub warehouse awesome-chaos-engineering. In addition, after fault injection, in addition to observing the availability of the system, we also need to ensure that the services provided by the system are correct, that is, the system still needs to meet the expected consistency. Jepsen is currently considered as the best practice in consistency verification in the engineering field (the following figure shows the Jepsen verifiable consistency model).

Jepsen can verify system consistency under specific failures. In the past five years, Kyle Kingsbury has helped numerous early distributed systems to be tested, such as Redis, Etcd, Zookeeper, and so on. The Jepsen system, as shown in the following figure, consists of six nodes, one control node, and five DB nodes. The control node can log in to the DB node through SSH. Through the control of the control node, the distributed system can be deployed in the DB node to form a cluster to be tested. After the test begins, the control node creates a set of processes that contain the clients of the distributed system to be tested. Another Generator process generates the actions performed by each client and applies the actions to the distributed system to be tested. The beginning and end of each operation and the results of each operation are recorded in the history. At the same time, a special process Nemesis introduces the fault into the system. At the end of the test, Checker analyzes whether the history is correct and consistent.

On the one hand, Jepsen provides a means of fault injection, which can simulate a variety of faults, such as network partition, process crash, CPU overload and so on. On the other hand, it provides various verification models, such as Set, Lock, Queue, etc., to detect whether various distributed systems still meet the expected consistency in the event of failure. Through the Jepsen test, we can find the hidden errors of the distributed system under extreme faults, so as to improve the fault tolerance of the distributed system. Therefore, Jepsen testing has been applied to the reliability testing of many distributed databases or distributed coordination service clusters, and has become an important means to verify the consistency of distributed systems. Now we take the log-based distributed repository DLedger and distributed message queue RocketMQ as examples to introduce the application of Jepsen testing in distributed message system.

Jepsen testing of DLedger

DLedger is a raft-based java library for building commitlog with high availability, high persistence, and strong consistency. As shown in the following figure, DLedger removes the state machine part of the raft protocol, but ensures that commitlog is consistent and highly available based on the Raft protocol.

Before Jepsen testing DLedger, you first need to know what kind of conformance DLedger needs to meet. In Jepsen testing, many distributed applications based on raft use linear consistency to verify the system. Linear consistency is one of the strongest consistency models. Systems that satisfy linear consistency can provide some services with unique constraints, such as distributed lock, master selection and so on. However, from the positioning of DLedger, it is an Append only log system, which does not need such strict consistency, and the final consistency of data is more in line with our requirements for the correctness of DLedger under failure. Therefore, the Set test of Jepsen is used to detect the consistency of DLedger under various faults.

The Set test process is shown in the following figure and is mainly divided into two phases. In the first stage, different clients add different data to the cluster to be tested concurrently, and fault injection will be carried out in the middle. In the second stage, a final read is made to the cluster to be tested, and the read result set is obtained. Finally, verify that each successfully added element is in the final result set, and that the final result set contains only the elements you are trying to add.

In the actual test, we open 30 client processes to add continuous and non-repetitive numbers to the DLedger cluster to be tested concurrently, and introduce specific faults, such as asymmetric network partitions, random killing nodes, and so on. The interval time of fault introduction is 30s, that is, 30s normal operation, 30s fault injection, always cycle, the whole phase lasts a total of 600s. At the end of the concurrent write phase, the final read is performed, the result set is obtained and verified.

In terms of fault injection, we test the following types of fault injection:

Partition-random-node and partition-random-halves failures are common symmetrical network partitions that are simulated.

Kill-random-processes and crash-random-nodes failures are situations where the simulation process crashes and nodes crash.

Hammer-time failure simulates some slow nodes, such as Full GC, OOM and so on.

Bridge and partition-majorities-ring simulate more extreme asymmetric network partitions.

We take the random network partition fault partition-random-halves as an example to analyze the test results. After the test is complete, the results shown in the following figure appear in the log:

You can see that a total of 167354 data (attempt-count) were sent by 30 clients during the test. Add successfully returned 167108 data (acknowledged-count), and actually successfully added 167113 data (ok-count). Due to the request timeout or most of the authentication timeout, it is impossible to determine whether the data was added successfully, but it appears in the final read result set (recovered-count). Because of lost-count=0 and unexpected-count=0, the final conformance verification result is passed. Better analyze the performance of the DLedger cluster in the testing process in the form of a chart. The latency of each operation of the client to the DLedger cluster is shown in the following figure.

The blue box indicates that the data was added successfully, the red box indicates that the data addition failed, the yellow box indicates that it is uncertain whether the data was added successfully, and the gray part of the figure indicates the time period of fault injection. It is reasonable to see that some fault injection periods cause temporary unavailability of the cluster, while some failure periods do not. Because it is a random network partition, only if the current leader is isolated to a small number of node areas will the cluster be re-elected, but even if the cluster is re-elected, the DLedger cluster will restore availability in a short period of time. In addition, we can see that because DLedger has a good fault-tolerant design for symmetrical network partitions, the cluster will not be re-elected after each failure recovery.

The following figure shows the percentage map of the delay of DLedger during testing.

You can see that except for the time period in which the cluster is re-elected after some failures are introduced, the delay increases. In other periods, the performance of the Dledger cluster is stable. 95% of the data add latency is below 5ms, and 99% of the data add delay is below 10ms. Under the fault injection of random symmetrical network partition, the performance of DLedger is stable and in line with expectations.

In addition to the random symmetrical network partition, DLedger has also passed the consistency verification of Set test under the other five kinds of fault injection, which proves the fault tolerance of DLedger to network partition, process, node crash and so on.

Jepsen testing of RocketMQ

Apache RocketMQ is a distributed message queue with low latency, high performance, high reliability and flexible scalability. RocketMQ supports DLedger deployment since version 4.5.0, which makes a single group of broker have the ability to fail over and have better availability and reliability. Now we use Jepsen to test the fault tolerance of the RocketMQ DLedger deployment pattern.

First of all, you still need to be clear about what kind of consistency RocketMQ needs to meet in the event of a failure. Jepsen provides total-queue testing for distributed systems. Total-queue testing requires the system to meet the requirements of queuing data must be dequeued, that is, message transmission must meet at-least-once. This is in line with our requirements for the correctness of RocketMQ under fault, so total-queue is used to test RocketMQ for Jepsen.

The total-queue test is shown in the following figure and is mainly divided into two phases. In the first stage, client processes randomly call queuing and dequeuing operations to the cluster concurrently, accounting for half of the queuing and dequeuing operations, and faults will be injected in the middle. In the second stage, in order to ensure that every data is out of queue, the client process invokes the drain operation and drains the queue.

In the actual testing process, we start four client processes to queue and dequeue the RocketMQ cluster to be tested concurrently, and specific faults will be introduced. The fault injection interval is 200s and the whole phase lasts a total of 1 hour. At the end of the first phase, the client performs a drain operation to drain the queue.

The six kinds of fault injection mentioned above are still tested, and the random kill node failure is taken as an example to analyze the test results (in order to ensure that the number of nodes killed will not make the entire cluster unavailable, the code ensures that only a few nodes are killed at a time). After the test is completed, the result shown in the following figure appears:

You can see that during the test, 30 clients tried to enroll in a total of 65947 data (attempt-count), and successfully returned 64390 data (acknowledged-count). In fact, 64390 data were successfully enlisted (ok-count), and there was no duplicate data out of the queue, so the consistency verification in case of failure passed.

We use charts to better analyze the performance of RocketMQ under faults. The following figure shows the delay diagram of each operation of the client to the RocketMQ cluster.

A small red triangle indicates failure to join the queue, and if there are a large number of red triangles during a period of time, the system is unavailable. From the figure, it can be found that there are some periods of time when the system is unavailable at the beginning of fault injection (grey area). This is caused by the failure causing the cluster to be re-elected, and the cluster can still restore availability after a period of time. However, it can be found that after the fault recovery, there is also a period of time when the system is unavailable, which is not in line with expectations.

Through log troubleshooting, it is found that the time when the cluster is unavailable after failure recovery is almost about 30 seconds, which is the registration interval between broker and nameserver. Further investigation found that the master broker routing information in nameserver was lost during this period of time. Originally, after the failure was restored, the killed broker process was restarted, and the default brokerId is zero. Before the brokerId is modified, broker registers with nameserver, thus overwriting the original master broker routing information, causing the cluster to be unavailable during this period of time. Fix the problem and redo the Jepsen test, and the delay diagram of the retest is shown in the following figure.

The results of the retest show that the problem has been fixed and there is no unavailable time period after the fault is restored. Through Jepsen testing, we found problems with the usability of the RocketMQ DLedger deployment model under fault injection and optimized the code to contribute to the RocketMQ community. We also tested the performance of RocketMQ under other fault injection, and all passed the consistency verification of total-queue test.

Some thoughts on Jepsen testing

Taking DLedger and RocketMQ as examples, we use Jepsen to verify the consistency of the distributed messaging system under failure. In the process of testing, some defects in the Jepsen framework are also found.

The Jepsen test cannot be run for a long time. Jepsen testing for a long time will produce a large amount of data, which leads to the emergence of OOM in its verification phase, but in the actual scene, many hidden bug need a long time of stress testing and fault simulation to find out, and the stability of the system also needs a long time to be verified.

The model provided by Jepsen testing does not fully cover specific areas. For example, in the field of distributed messaging, Jepsen only provides tests of queue and total-queue to verify whether messages will be lost and duplicated in the event of a failure. However, the effectiveness of the important partition ordering, global ordering and rebalancing algorithm for distributed message queues is not covered.

The distributed message standard openmessaging community also tries to solve these problems, so as to provide more complete reliability verification in the message field. DLedger's Jepsen test code has also been placed under openmessaging's openmessaging-dledger-jepsen warehouse and provides Docker startup mode, which makes it easy for users to test on a single machine quickly.

The cost of the error caused by the failure is very high, and the interruption of the e-commerce website will lead to a huge loss of revenue and reputation. when the system provided by the cloud manufacturer goes down or breaks down, it will bring a heavy price to the users or their users. How to make the distributed system complete the function correctly and achieve the desired performance level in the dilemma (hardware failure, software failure, human error)? this should not only be solved from the algorithm design and code implementation, but also need to use distributed system testing tools to simulate all kinds of faults in advance, find the deep problems from the failure, and improve the fault tolerance of the system. Only in this way can the accidental loss be minimized when the accident does occur.

Thank you for your reading, the above is the content of "what is the Jepsen testing method of DLedger". After the study of this article, I believe you have a deeper understanding of what the Jepsen testing method of DLedger is, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.