Practical Analysis of Jepsen testing Framework in Graph Database Nebula Graph 07/06 Update SLTechnology News&Howtos

Practical Analysis of Jepsen testing Framework in Graph Database Nebula Graph

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

The practical analysis of Jepsen testing framework in the graph database Nebula Graph, aiming at this problem, this article introduces the corresponding analysis and solution in detail, hoping to help more partners who want to solve this problem to find a more simple and feasible method.

Introduction to Jepsen

Jepsen is an open source software library for system testing, which aims to improve the security of distributed databases, queues, consensus systems, and so on. The author Kyle Kingsbury wrote this testing framework using functional programming language Clojure and tested the conformance of several well-known distributed systems and databases. At present, Jepsen is still active in GitHub, and whether it can pass the test of Jepsen has become a benchmark for each distributed database to test itself.

The testing process of Jepsen

Docker is recommended to build a cluster in Jepsen testing. By default, it consists of six container, one of which is the control node (control node) and the other five are nodes of the database (default is n1-n5). After the start of the test program, the control node will enable multiple worker processes and concurrently log in to the database node through SSH for read and write operations.

After the test begins, the control node creates a set of processes that contain the clients of the distributed system to be tested. Another Generator process generates the actions performed by each client and applies the actions to the distributed system to be tested. The beginning and end of each operation and the results of each operation are recorded in the history. At the same time, a special process Nemesis introduces the fault into the system.

At the end of the test, Checker analyzes whether the history is correct and consistent. Users can use the verification model provided in Jepsen's knossos, or you can define a model that meets the requirements to verify the test results. At the same time, errors can be injected into the test to interfere with the cluster.

Finally, the results are analyzed according to the verification model specified in this test.

How to use Jepsen

You may encounter some problems while using Jepsen. You can refer to using Tips:

In the Jepsen framework, users need to download, install, start and terminate their database definitions in the DB interface. After termination, the log file can be cleared, or the storage location of the log can be specified, and Jepsen copies it to the log folder of the Jepsen for analysis along with Jepsen's own log.

Users also need to provide a client to access their own database, which can be implemented in Clojure, such as etcd's verschlimmbesserung, JDBC, and so on. Then you need to define the Client interface to tell Jepsen how to manipulate your database.

In Checker, you can choose the test model you want, for example, performance testing (checker/perf) will generate a chart of latency and the entire test process, and the timeline (timeline/html) will generate a html page that records all operational timelines.

Another indispensable component is to inject errors into nemesis that you want to test. Network partitioning (nemesis/partition-random-halves) and killing data nodes (kill-node) are common injection errors.

In Generator, users can tell the worker process which actions need to be generated, the time interval of each operation, the time interval of each error injection, and so on.

Using Jepsen Test pattern Database Nebula Graph

The distributed map database Nebula Graph is mainly composed of three parts, namely, meta layer, graph layer and storage layer.

In the test of the kv storage interface using Jepsen, we built a cluster of eight container: a control node of Jepsen, a meta node, a graph node, and five storage nodes, and the cluster was started by Docker-compose. It should be noted that to set up a cluster subnet network so that the cluster can be connected, install the ssh service, and configure secret login-free login for control node and five storage nodes.

In the test, the client program written by Java is used to generate the jar package and add the Clojure program dependency to put,get and cas (compare-and-set) operations on DB. In addition, the client of Nebula Graph has automatic retry logic. When an error causes the operation to fail, the client will enable the appropriate retry mechanism to ensure the success of the operation.

Nebula-Jepsen 's test program is currently divided into three common test models and three common error injections.

Jepsen test model single-register

Simulate a register, the program reads and writes the database concurrently, each successful write operation will change the value stored in the register, and then verify whether the result meets the linear requirement by comparing whether the value read from the database is consistent with the value recorded in the register. Since the register is single, here we generate a unique key and a random value for operation.

Multi-register

A register that can store different keys. The effect is the same as that of a single register, but here we can make key randomly generated as well.

4: invoke: write [[: w 9 1]] 4: ok: write [[: w 9 1]] 3: invoke: read [[: r 5 nil]] 3: ok: read [[: r 5 3]] 0: invoke: read [[: r 7 nil]] 0: ok: read [[: r 7 2]] 0: invoke: write [[: W 7 1]] 0: ok: write [[: w 7 1]] 1: invoke: read [[: r 1 nil]] 1: ok: read [[: r 1 4]] 0: invoke: read [[: r 8 nil]] 0: ok: read [[: r 8 3]]: nemesis: info: start nil:nemesis: info: start [: isolated {"N5" # {"N2"N1"N4"N3"} "N2" # {"N5"}, "N1" # {"N5"}, "N4" # {"N5"} "N3" # {"N5"}}] 1: invoke: write [[: W 4 2]] 1: ok: write [[: W 4 2]] 2: invoke: read [[: r 5 nil]] 3: invoke: write [[: W 1 2]] 2: ok: read [[: r 5 3]] 3: ok: write [[: W 1 2] ] 0: invoke: read [[: r 4 nil]] 0: ok: read [[: r 4 2]] 1: invoke: write [[: w 6 4]] 1: ok: write [[: w 6 4]]

The above snippet is a small number of examples of different read and write operations in the intercepted test.

The leftmost number is the worker, the process number, that performed the operation. Each time an action is initiated, the flag is invoke, and the next column indicates whether it is a write or read operation, while the next column shows the specific action in square brackets, such as

: invoke: read [[: r 1 nil]] is to read the value whose key is 1. Because it is invoke, the operation has just started, and we don't know what the value is, so it is followed by nil.

The ok in ok: read [[: r 1 4]] indicates that the operation was successful. You can see that the value corresponding to key 1 read is 4.

In this clip, you can also see a moment when nemesis is injected.

: nemesis: info: start nil marks the beginning of nemesis, and the following content (: isolated...) indicates that node N5 is isolated from the entire cluster and cannot communicate with other DB nodes.

Cas-register

This is a register that verifies CAS operations. In addition to the read and write operations, this time we have added randomly generated CAS operations, and cas-register will do a linear analysis of the results.

0: invoke: read nil0: ok: read 01: invoke: cas [0 2] 1: ok: cas [0 2] 4: invoke: read nil4: ok: read 20: invoke: read Nil0: ok: read 22: invoke: write 02: ok: write 03: invoke: cas [22]: nemesis: info: start nil0: invoke: read nil0: ok: read 01: invoke : cas [1 3]: nemesis: info: start {"N1"} 3: fail: cas [2 2] 1: fail: cas [1 3] 4: invoke: read nil4: ok: read 0

Similarly, in this test, we use a unique key value, such as all write and read operations are performed on the key "f", omitting the key in square brackets on the display and showing only what the value is.

: invoke: read nil means to start an operation to read the value of "f", because the operation has just started, so the result is nil (empty).

: ok: read 0 indicates that the key "f" was successfully read with a value of 0.

Invoke: cas [1 / 2] means to perform a CAS operation, changing the value to 2 when the value read is 1.

As you can see in the second line, when the saved value is 0, cas [0 2] on line 4 changes the value to 2. When the equivalent value of line 14 is 0, cas [2 2] of line 17 fails.

Line 16 shows the killing of the N1 node, and lines 17 and 18 have two cas failures (fail)

Jepsen error injects kill-node

During the whole testing process, the control node of Jepsen will randomly kill the database service in a node many times and stop the service. At this point, one node is missing from the cluster. Then, after a certain period of time, the database service of the node is started to rejoin the cluster.

Partition-random-node

In the course of testing, Jepsen randomly isolates a node from the network of other nodes many times, so that the node cannot communicate with other nodes, and other nodes cannot communicate with it. Then the network isolation is restored after a certain period of time to restore the cluster to its original state.

Partition-random-halves

In this common network partitioning scenario, the Jepsen control node randomly divides five DB nodes into two parts, one with two nodes and the other with three. Resume communication after a certain period of time. This is shown in the following figure.

After the test is over

Jepsen will analyze the test results according to the requirements, and get the test results. You can see the output of the console, and this test is passed.

2020-01-08 03 jepsen test runner 24 jepsen test runner 51742 {GMT} INFO [jepsen test runner] jepsen.core: {: timeline {: valid True},: linear {: valid? True,: configs ({: model {: value 0},: last-op {: process 0,: type: ok,: f: write,: value 0,: index 597,: time 60143184600},: pending []}),: analyzer: linear,: final-paths ()},: valid? True} Everything looks good! Timeline.html file automatically generated by "('room`)"

Jepsen will automatically generate a file called timeline.html during test execution. The following is a partial screenshot of the timeline.html file generated by this practice.

The above picture shows the timeline segment of the action performed in the test, each execution block has the corresponding execution information, and Jepsen will generate a HTML file for the entire timeline.

This is how Jepsen verifies Linearizability conformance according to the sequential history of operations, which is also the core of Jepsen. We can also use this HTML file to help us trace the source of the error.

Performance analysis diagrams generated by Jepsen

Here are some performance analysis charts generated by Jepsen. This practice project is called "basic-test". When you read it, please fill in the name of your project.

As you can see, this chart shows the read and write latency of Nebula Graph. The gray area above is the time period for error injection, and in this test we injected random kill node.

In this figure showing the success rate of read and write operations, we can see that the bottom red set highlights the failure because control node terminates the nebula service in the leader of a partition when it kills the node. The cluster needs to be re-elected at this time, and after the new leader is elected, the read and write operations return to normal.

By observing the running results of the test program and analyzing the chart, we can see that Nebula Graph has completed the test of injecting kill-node errors into the single register model, and the read and write operation delay is also in the normal range.

Jepsen itself has some shortcomings, such as the test can not run for a long time, because a large amount of data can cause Out of Memory during the verification phase.

But in the actual scenario, many bug need a long time of stress testing and fault simulation to find out, at the same time, the stability of the system needs a long time to be verified. But at the same time, in the process of testing Nebula Graph with Jepsen, we also found some Bug that we had not encountered before, and even some of them may never appear in use.

Currently, we have used Jepsen to test Nebula Graph in the daily development process. After Nebula Graph has a code update, the compiled project will be published in Docker Hub every night, and Nebula-Jepsen will automatically pull down the latest image for continuous testing.

This is the answer to the question about the practical analysis of the Jepsen testing framework in the diagram database Nebula Graph. I hope the above content can be of some help to you. If you still have a lot of doubts to be solved, you can follow the industry information channel for more related knowledge.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.