Nebula Architecture Analysis Series (1) Storage Design of Graph Database 07/02 Update SLTechnology News&Howtos

Nebula Architecture Analysis Series (1) Storage Design of Graph Database

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)06/01 Report--

Abstract

When discussing a database, Storage and Query Engine are usually hot topics of discussion, and are also an indispensable part of enthusiasts' understanding of a database. Each database has its own unique storage and calculation method. Today, we will learn the storage part of the database Nebula Graph in the following figure with the diagram.

The Storage of Nebula consists of two parts, one is meta-related storage, which we call Meta Service, and the other is data-related storage, which we call Storage Service. The two services are two separate processes, and the data is completely isolated. Of course, the deployment is also deployed separately, but the overall architecture of the two is not much different, which will be mentioned at the end of this article. Unless otherwise specified, Storage Service refers to data's storage service in this article. Next, let's take a look at the entire architecture of Storage Service with me. Let's go~

Architecture

Figure 1 storage service architecture diagram

As shown in figure 1, Storage Service has three layers, and the lowest layer is Store Engine, which is a stand-alone version of local store engine, which provides get / put / scan / delete operations for local data. The related interfaces are placed in KVStore / KVEngine.h. Users can customize and develop relevant local store plugin according to their own needs. At present, Nebula provides Store Engine based on RocksDB.

On top of local store engine is our Consensus layer, which implements Multi Group Raft. Each Partition corresponds to a set of Raft Group, and the Partition here is our data shard. At present, the fragmentation strategy of Nebula adopts the method of static Hash, and the specific way to carry out Hash will be mentioned in the next chapter schema. Users need to specify the number of Partition when creating a SPACE. Once the number of Partition is set, it cannot be changed. Generally speaking, the number of Partition should be able to meet the needs of future business expansion.

Above the Consensus layer, which is the top layer of the Storage Service, is our Storage interfaces, which defines a series of API related to the graph. These API requests are translated at this layer into a set of kv operations for the corresponding Partition. It is the existence of this layer that makes our storage service become a real graph storage, otherwise, Storage Service is just a kv storage. However, Nebula does not put forward kv as a service separately, the main reason is that a large number of calculations are involved in the process of graph query, these calculations often need to use the schema of the graph, and the kv layer does not have the concept of data schema, so the design will be easier to implement computing push-down.

Schema & Partition

The main data stored in a graph are points and edges, but the data stored in Nebula is an attribute graph, that is, in addition to points and edges, Nebula also stores their corresponding attributes in order to use attribute filtering more efficiently.

For points, we use different Tag to represent different types of points, the same VertexID can be associated with multiple Tag, and each Tag has its own corresponding properties. Corresponding to kv storage, we use vertexID + TagID to represent key, and we encode related attributes and put them in value. The format of key is shown in figure 2:

Fig. 2 Vertex Key Format

Type: 1 byte, used to represent the key type. The current types are data, index, system, etc.

Part ID: 3 bytes used to represent data shard Partition. This field is mainly used for Partition redistribution (balance) to facilitate scanning the entire Partition data according to the prefix.

Vertex ID: 4 bytes, used to represent the ID of a point

Tag ID: 4 bytes, used to represent an associated tag

Timestamp: 8 bytes, invisible to the user, used in future distributed transactions (MVCC)

In a graph, each logical edge is modeled as two separate key-value in Nebula Graph, called out-key and in-key. The starting point of the out-key and the edge are stored on the same partition, and the in-key and the end point of the edge are stored on the same partition. Generally speaking, out-key and in-key are distributed in two different Partition.

There may be multiple types of edges between two points, and Nebula uses Edge Type to represent the edge type. While there may be multiple edges of the same type, for example, define an edge type "transfer", and user A may transfer money to B multiple times, so Nebula adds a Rank field to distinguish between An and B, indicating multiple transfers between An and B. The format of Edge key is shown in figure 3:

Fig. 3 Edge Key Format

Type: 1 byte, used to indicate the type of key. The current types are data, index, system, etc.

Part ID: 3 bytes used to represent data shard Partition. This field is mainly used for Partition redistribution (balance) to facilitate scanning the entire Partition data according to the prefix.

Vertex ID: 4 bytes, the ID used to represent the source point inside the edge, and the ID used to represent the target point inside the edge.

Edge Type: 4 bytes, used to indicate the type of edge, if greater than 0 indicates the edge, less than 0 indicates the edge.

Rank: 4 bytes, used to handle situations where there are multiple edges of the same type. Users can set it according to their own needs. This field can store transaction time _, transaction serial number, or _ a sort weight _.

Vertex ID: 4 bytes, the ID used to represent the target point inside the edge, and the ID used to represent the source point inside the edge.

Timestamp: 8 bytes, invisible to the user, and will be used in distributed transactions in the future.

For the value of Edge Type, if the value is greater than 0, the corresponding edge key format is shown in figure 4; if the value of Edge Type is less than 0, the corresponding edge key format is shown in figure 5

Figure 4 Key Format of the outgoing edge figure 5 Key Format of the incoming edge

For the attribute information of points or edges, there is a set of kv pairs,Nebula that encodes them and stores them in the corresponding value. Because Nebula uses strongly typed schema, you need to go to Meta Service to get specific schema information before decoding. In addition, in order to support online schema changes, the corresponding schema version information will be added when encoding attributes. The specific codec details will not be expanded here, and there will be a special article to explain this content later.

OK, here we basically understand how Nebula stores data, so how does the data be sliced? It's very simple, just take the model for Vertex ID. By modulating the Vertex ID, all _ out _, _ in _ and all associated _ Tag information at the same point will be assigned to the same Partition, which greatly improves the query efficiency. For on-line graph queries, the most common operation is to expand outward BFS (breadth first) from a point, so taking the outbound or incoming edge of a point is the most basic operation, and the performance of this operation also determines the performance of the whole traversal. Pruning according to certain attributes may occur in BFS, and Nebula ensures the efficiency of the whole operation by existing the attribute with the edge of the point. At present, many graph databases verify their efficiency through the dataset test of Graph 500 or Twitter, which is not representative, because these datasets have no attributes, but most of the actual scenarios are attribute graphs, and the actual BFS also requires a lot of pruning operations.

KVStore

Why do we have to do our own KVStore? this is the question we have been asked countless times. The reason is simple: the current open source KVStore is difficult to meet our requirements:

Performance, performance: the requirements of Nebula are straightforward: high performance pure kv

Provided in the form of library: for Nebula with strong schema, computing pushdown requires schema information, and the implementation of computational pushdown is the key to the efficiency of Nebula.

Strong consistency of data: this is determined by the distributed system

Using C++: this is determined by the technical characteristics of the team

Based on the above requirements, Nebula implements its own KVStore. Of course, for users who are completely insensitive to performance and do not want to move data, Nebula also provides plugin for the entire KVStore layer, directly building Storage Service on top of a third-party KVStore. Currently, the official plugin is HBase.

Nebula KVStore mainly uses RocksDB as the local storage engine. For multi-hard disk machines, in order to make full use of the concurrent ability of multiple hard disks, Nebula supports the management of multiple disks, and users only need to configure multiple different data directories. The management of distributed KVStore is uniformly dispatched by Meta Service, which records the distribution of all Partition and the status of current machines. When users add or subtract machines, only need to input the corresponding instructions through console, Meta Service can generate the whole balance plan and execute. The reason why fully automatic balance is not adopted is mainly to reduce the impact of data migration on online services, and the timing of balance is controlled by users. )

In order to facilitate the customization of WAL, Nebula KVStore implements its own WAL module, and each partition has its own WAL, so that when chasing data, there is no need for wal split operation, which is more efficient. In addition, in order to implement some special operations, the category Command Log is specifically defined, and these log are only used to use Raft to tell all replica to perform a particular operation, and there is no real data. In addition to Command Log, Nebula also provides a class of logs to implement atomic operation for a particular Partition, such as CAS,read-modify-write, which makes full use of the serial features of Raft.

Support for multi-image space (space): a Nebula KVStore cluster can support multiple space, and each space can set its own number of partition and replica. Different space are physically isolated, and different space on the same cluster can support different store engine and sharding strategies.

Raft

As a distributed system, the replication,scale out and other functions of KVStore need the support of Raft. At present, there are many articles about Raft on the market, the specific principle of the content, here will not repeat, this paper mainly talks about some of the characteristics of Nebula Raft and engineering implementation.

Multi Raft Group

Since Raft logs are not allowed to be empty, almost all implementations use Multi Raft Group to alleviate this problem, so the number of partition almost determines the performance of the entire Raft Group. But this does not mean that the more Partition, the better: each Raft Group has to store a series of state information inside, and each Raft Group has its own WAL file, so too many Partition will increase the overhead. In addition, when there is too much Partition, if the load is not high enough, the batch operation is meaningless. For example, an online system with 1w tps has more than 1w of stand-alone partition, and each Partition may have only one tps per second, so the batch operation loses its meaning and increases the CPU overhead. There are two key points in the implementation of Multi Raft Group: * * the first is to share the Transport layer * *, because each Raft Group needs to send messages to the corresponding peer, if the Transport layer cannot be shared, the connection cost is huge; the second is the thread model. Mutli Raft Group must share a set of thread pools, otherwise it will result in too many threads in the system, resulting in a lot of context switch overhead.

Batch

For each Partition, because serial write WAL, in order to improve throughput, do batch is very necessary. Generally speaking, there is nothing special about batch, but Nebula makes use of the serial characteristics of each part to do some special types of WAL, which brings some engineering challenges.

For example, Nebula uses WAL to implement lock-free CAS operations, and each CAS operation requires all the commit of the previous WAL before it can be performed, so for a batch, if there are several WAL of CAS type in the middle, we also need to divide the batch into several smaller group,group to ensure serial. Also, the WAL of command type requires the WAL behind it to be executed after its commit, so the operation engineering implementation of the whole batch partition group is quite characteristic. Is Jiaozuo traditional Medical Gastrointestinal Hospital formal? Jiaozuo Old Brand certified by Health Bureau: https://www.jianshu.com/p/3c43af3bb611

Learner

The role of Learner exists mainly to cope with the expansion, when the new machine needs to "chase" data for a long period of time, during which accidents may occur. If you start chasing data directly as follower, it will reduce the HA capability of the whole cluster. The implementation of learner in Nebula adopts the above-mentioned command wal,leader that if it encounters the command of add learner when writing wal, it will add learner to its own peers and mark it as learner, so that learner will not be counted when counting the majority, but logs will be sent to them as usual. Of course, learner will not initiate elections.

Transfer Leadership

Transfer leadership is a very important operation for balance. When we move a Paritition from one machine to another, we first check whether source is a leader. If so, we need to move it to another peer first. After the relocation data is completed, the leader is usually balance, so that the load borne by each machine can be balanced.

To implement transfer leadership, you need to pay attention to the timing when leader gives up its leadership and follower starts leader election. For leader, when transfer leadership command is in commit, it abandons leadership;, while for follower, it starts to leader election when it receives this command. This set of implementation should follow the same path as Raft's own leader election, otherwise it is easy to have some corner case that is difficult to deal with.

Membership change

In order to avoid cerebral fissure, when the membership of a Raft Group changes, there needs to be an intermediate state, in which the majority of old group and the majority of new group always have overlap, which prevents old group or the new group from making unilateral decisions, which is the joint consensus mentioned in the paper. For further simplification, Diego Ongaro proposed in his doctoral thesis to add or subtract one peer at a time to ensure that the majority of old group always has overlap with the majority of new group. The implementation of Nebula also uses this way, but the implementation of add member is different from that of remove member. The specific implementation method is not discussed in this paper. Interested students can refer to the implementation of addPeer / removePeer in Raft Part class.

Snapshot

The paper does not elaborate on how Snapshot combines with the Raft process, but I think this part is the most error-prone part of an Raft implementation, because there is a lot of corner case generated here.

For example, what if the leader changes when leader sends snapshot? At this point, it is possible that follower received only half of the snapshot data. Therefore, there needs to be a Partition data cleaning process, because multiple Partition share a storage, so how to clean up the data is a very troublesome problem. In addition, a large number of IO will be generated in the snapshot process. For performance considerations, we do not want this process to share an IO threadPool with a normal Raft, and a large amount of memory is needed in the whole process. How to optimize the use of memory is critical to performance. Because of the space, we will not talk about these issues in this article. Interested students can refer to the implementation of SnapshotManager.

Storage Service

On top of the interface of KVStore, Nebula encapsulates a graph semantic interface. The main interfaces are as follows:

GetNeighbors: query the outbound or incoming edges of a batch of points, return edges and corresponding attributes, and support conditional filtering

Insert vertex/edge: inserts a point or edge and its attributes

GetProps: gets the attribute of a point or edge

This layer translates the interface of graph semantics into kv operations. In order to improve the performance of traversal, concurrency operations should be done.

Meta Service

On the interface of KVStore, Nebula also encapsulates a set of meta-related interfaces. Meta Service not only provides the functions of adding, deleting, querying and modifying the figure schema, but also provides cluster management functions and user authentication related functions. Meta Service supports separate deployment, as well as the use of multiple copies to secure data.

Summary

This article gives you a general introduction to the overall design of the Nebula Storage layer. Due to the space, many details have not been discussed. Welcome to our WeChat group to ask questions and join the NebulaGraph communication group. Please contact the official NebulaGraph assistant WeChat account: NebulaGraphbot.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.