How to realize the Index practice of distributed Graph Database Nebula Graph 07/02 Update SLTechnology News&Howtos

How to realize the Index practice of distributed Graph Database Nebula Graph

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail the Index practice of how to implement the distributed map database Nebula Graph. The content of the article is of high quality, so the editor shares it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Guide reading

Index is an indispensable function in the database system. Database index is like the catalogue of books, which can speed up the query speed of the database. Its essence is a sorted data structure in the database management system. Different database systems have different sorting structures. At present, common index implementation types such as B-Tree index, B+-Tree index, B*-Tree index, Hash index, Bitmap index, Inverted index and so on, all have their own sorting algorithms.

Although indexes can lead to higher query performance, there are some disadvantages, such as:

It takes extra time to create and maintain indexes, and maintenance costs often increase as the amount of data increases.

Indexes need to take up physical space

It takes more time to add, delete and modify the data, because the index also needs synchronous maintenance.

As a high-performance distributed graph database, Nebula Graph also implements the indexing function for high-performance query of attribute values. This article will give a detailed introduction to the indexing function of Nebula Graph.

Figure database Nebula Graph terminology

Before we begin, here are some graph databases and Nebula Graph proprietary terms that may be used:

Tag: the attribute structure of a point. Multiple tag can be attached to a Vertex, identified by TagID. (if you compare SQL, it can be understood as a table of dots.)

Edge: similar to Tag,EdgeType, it is an attribute structure on the edge, identified by EdgeType. (if you compare SQL, it can be understood as an edge table.)

An attribute value on Property:tag / edge whose data type is determined by the structure of tag / edge.

The smallest logical storage unit of a Partition:Nebula Graph, a StorageEngine can contain multiple Partition. Partition is divided into the roles of leader and follower. Raftex ensures the data consistency between leader and follower.

Graph space: each Graph Space is a separate business Graph unit, and each Graph Space has its own collection of tag and edge. A Nebula Graph cluster can contain multiple Graph Space.

Index: the Index that appears in this article refers to the attribute index in the middle and edge of the nebula graph. Its data type depends on tag / edge.

TagIndex: based on an index created by tag, multiple indexes can be created by one tag. Currently (2020.3) composite indexes across tag are not supported, so an index can only be based on one tag.

EdgeIndex: an index based on Edge. Similarly, an Edge can create multiple indexes, but an index can only be based on one edge.

Scan Policy:Index scanning strategy, often a query statement can have multiple index scanning methods, but the specific use of which scanning method needs to be decided by Scan Policy.

Optimizer: optimize query conditions, such as sorting, splitting, merging, etc., on the expression tree of the where clause. Its purpose is to obtain higher query efficiency.

Index requirement analysis

Nebula Graph is a graph database system, the query scenario usually starts from a point, finds out the set of related points of the specified edge type, and so on (breadth-first traversal) N-degree query. Another query scenario is to give an attribute value and find all the points or edges that match that attribute value. In the latter scenario, a high-performance scan of the attribute value is required to find the edge or point corresponding to the attribute value, as well as other attributes on the edge or point. In order to improve the query efficiency of attribute values, the function of index is introduced here. Sort the attribute values of an edge or point so that you can quickly locate to an attribute. In order to avoid full table scanning.

You can see the index requirements for the figure database Nebula Graph:

Attribute indexes that support tag and edge

Analysis and generation of scanning strategy supporting index

Support index management, such as: new index, re-index, delete index, list | show index, etc.

System architecture overview database Nebula Graph storage architecture

As you can see from the architecture diagram, each Storage Server can contain multiple Storage Engine, each Storage Engine can contain multiple Partition, and different Partition can be synchronized consistently through the Raft protocol. Each Partition contains both data and index, and the data and index of the same point or edge will be stored in the same Partition.

Business-specific analysis of data storage structure

In order to better describe the storage structure of the index, the storage structure of the original data of the graph database Nebula Graph is analyzed here.

Storage structure of points Data structure of points

Index structure of a point

The index structure of Vertex is shown in the table above. The following fields are described in detail:

PartitionId: the data and index of a point are logically stored in the same partition. There are two main reasons for doing so:

When scanning an index, a point data in the same partition can be quickly obtained according to the key of the index, so that any attribute value of this point can be easily obtained, even if the attribute column does not belong to this index.

At present, the storage of edge is determined by the ID Hash distribution of the starting point. In other words, where the outedge of a point is stored is determined by the VertexId of the point. If this point and its outedge are stored in the same partition, the index scan of the point can quickly locate the outedge of the point.

The identification code of the IndexId:index. The metadata information of the specified index can be obtained through indexId, for example, the information of the column of the TagId,index associated with the index.

The core storage structure of Index binary:index, which is the byte encoding of all index related column attribute values, is explained in detail in the # Index binary# chapter of this article.

VertexId: the identification code of the point. In the actual data, a point may have multiple lines of data of different version. But in index, index has no concept of Version, and index always corresponds to the Tag of the latest Version.

After talking about the fields above, let's simply practice and analyze a wave:

Suppose that PartitionId is 100 Magi TagId, there are tag_1 and tag_2, where tag_1 contains three columns: colt1_1, col_t1_2, col_t1_3,_tag_2 contains two columns: col_t2_1, col_t2_2.

Now let's create the index:

I1 = tag_1 (col_t1_1, col_t1_2), assuming that the ID of i1 is 1

I2 = tag_2 (col_t2_1, col_t2_2), assuming that the ID of i2 is 2

You can see that although there is a col_t1_3 column in tag_1, col_t1_3 is not used to build the index, because the index in the graph database Nebula Graph can be created based on one or more columns of Tag.

Insertion point / / VertexId = hash ("v_t1_1"), if 50 INSERT VERTEX tag_1 (col_t1_1, col_t1_2, col_t1_3), tag_2 (col_t2_1, col_t2_2)\ VALUES hash ("v_t1_1"): ("v_t1_1", "v_t1_2", "v_t1_3", "v_t2_1", "v_t2_2")

From the above, you can see that the value corresponding to the ID ID can be obtained by Hash. If the corresponding value of the ID is already int64, there is no need for Hash or other operations to convert the value to int64. At this point, the data is stored as follows:

The Data structure at this time

The Index structure at this time

Description: row and key in index is a concept, which is the unique identification of the index.

Edge storage structure

The principle of edge index structure is similar to that of point index structure, so I won't repeat it here. However, it should be noted that in order to establish the uniqueness of the index key, the key of the index is generated with the help of many elements in the data, such as VertexId, SrcVertexId, Rank, etc., which is why there is no TagId field in the point index (there is no EdgeType field in the edge index), this is because the IndexId itself with information such as VertexId can directly distinguish between the specific tagId or EdgeType.

Data structure of edges

Index structure of edges

Index binary introduction

Index binary is the core field of index, which can be divided into fixed length field and variable length field in index binary, int, double and bool are fixed length fields, and string is indefinite length fields. Because index binary encodes and stores the attribute values of all index column, in order to accurately locate the variable length field, Nebula Graph records the length of the variable length field with int32 at the end of the index binary.

For example:

We now have an index binary of index1, which consists of index column 1 C1 of type int and index column c2 of type string and index column c3 of type c2.

Index1 (c1:int, c2:string, c3:string)

If the property values for a row of index columns C1, c2, and c3 are 23, "abc", and "here", respectively, these index columns will be stored as follows in index1 (in order to make it easier to understand, we directly use the original values, which are encoded and then stored in the actual storage):

Length = sizeof ("abc") = 3

Length = sizeof ("here") = 4

So the key corresponding to the index1 row is 23abchere34.

Going back to the index binary format that we said at the beginning of the Index binary chapter, there is a Variable-length field lenght field, so what is the specific purpose of this field? Let's give a simple example:

Now we have another index binary, which we call index2, which consists of index column 1 C1 of type string and index column c2 of type string and index column c3 of type string:

Index2 (c1:string, c2:string, c3:string)

Suppose we now have two sets of values for C1, c2, and c3 respectively:

Row1: ("ab", "ab", "ab")

Row2: ("aba", "ba", "b")

You can see that the prefix of the two lines (the red part above) is the same, both are "ababab". At this time, how to distinguish the key of the index binary of the two row? Don't worry, we have Variable-length field lenght.

If we encounter conditional query statements such as where C1 = = "ab", we can read out the length of C1 directly according to the order in Variable-length field length, and then take out the value of C1 in row1 and row2 according to this length, which are "ab" and "aba" respectively, so we can accurately judge that only the "ab" in row1 meets the query conditions.

The processing logic of index Index write

When an index is created by one or more columns in Tag / Edge, the corresponding index must be modified along with the data when writing operations related to Tag / Edge are involved. The following is a brief introduction to the processing logic of the write operation of the index at the storage layer:

INSERT-- insert data

When the user generates an insertion point / edge operation, insertProcessor first determines whether the inserted data has an indexed Tag attribute / Edge attribute. If there is no associated attribute column index, the new Version is generated as usual, and the data is put to Storage Engine;. If there is an associated attribute column index, it is written to Data and Index through the atomic operation, and it is determined whether the current Vertex / Edge has the old attribute value, and if so, the old attribute value is also deleted in the atomic operation.

DELETE-- delete data

When a user has a Drop Vertex / Edge operation, deleteProcessor deletes both Data and Index (if any), and atomic operations are also required in the deletion process.

UPDATE-- updates data

For Index, the update operation of Vertex / Edge is the operation of drop and insert: delete the old index and insert the new index. In order to ensure the consistency of the data, it also needs to be carried out in the atomic operation. However, for a normal Data, it is only an insert operation, and you can overwrite the data of the old Version with the Data of the latest Version.

Index scan

In the graph database Nebula Graph, the LOOKUP statement is used to deal with the index scan operation. The LOOKUP statement can use the attribute value as the judgment condition to find out all the points / edges that meet the conditions. Similarly, the LOOKUP statement supports WHERE and YIELD clauses.

Skills of using LOOKUP

As described in section # data storage structure # of this article, the index columns in index are determined in the order in which they were created when the index was created.

For example, we now have tag (col1, col2), according to which we can create different indexes, such as:

Index1 on tag (col1)

Index2 on tag (col2)

Index3 on tag (col1, col2)

Index4 on tag (col2, col1)

We can build multiple indexes on clo1 and col2, but in scan index, the results returned by the above four index are different, even completely different. Which index is used in the actual business, and the optimal execution strategy of index is determined by the index optimizer.

Let's take an in-depth analysis of the four examples of index just now:

Lookup on tag where tag.col1 = = 1 # the best index is index1lookup on tag where tag.col2 = = 2 # the best index is index2lookup on tag where tag.col1 > 1 and tag.col2 = = 1 # index3 and index4 are both valid index, while index1 and index2 are invalid

In the third example above, both index3 and index4 are valid index, but eventually one of them must be selected as index. According to the optimization rule, because tag.col2 = = 1 is an equivalent query, it is more efficient to give priority to tag.col2, so the optimizer should choose index4 as the optimal index.

Fuck the Nebula Graph index of the graph database

In this part, we will not explain specifically the purpose of a sentence. If you are not clear about the sentence, you can go to the official forum of the database Nebula Graph to ask a question: https://discuss.nebula-graph.io/.

Creation of CREATE-- index (user@127.0.0.1:6999) [(none)] > CREATE SPACE my_space (partition_num=3, replica_factor=1); Execution succeeded (Time spent: 15.566 Time spent 16.602 ms) Thu Feb 20 12:46:38 2020 (user@127.0.0.1:6999) [(none)] > USE my_space Execution succeeded (Time spent: 7.681 ms) Thu Feb 20 12:46:51 2020 (user@127.0.0.1:6999) [my_space] > CREATE TAG lookup_tag_1 (col1 string, col2 string, col3 string); Execution succeeded (Time spent: 12.228 ms) Thu Feb 20 12:47:05 2020 (user@127.0.0.1:6999) [my_space] > CREATE TAG INDEX t_index_1 ON lookup_tag_1 (col1, col2, col3) Execution succeeded (Time spent: 1.639 Time spent 2.271 ms) Thu Feb 20 12:47:22 2020DROP-delete index (user@127.0.0.1:6999) [my_space] > DROP TAG INDEX tweets indexation 1 auditory execution succeeded (Time spent: 4.147ash 5.192 ms) Sat Feb 22 11:30:35 20REBUILDTV-rebuild the index

If you are upgrading from an older version of Nebula Graph, or if the index is not open during bulk writes with Spark Writer (for performance), then the data has not yet been indexed, and you can use the REBUILD INDEX command to re-index it all. This process can take a long time, and the client reads and writes slower before the rebuild index is completed.

REBUILD {TAG | EDGE} INDEX [OFFLINE] LOOKUP-- use index

To be clear, make sure you have an index (CREATE INDEX or REBUILD INDEX) before using the LOOKUP statement.

(user@127.0.0.1:6999) [my_space] > INSERT VERTEX lookup_tag_1 (col1, col2, col3) VALUES 200: ("col1_200", "col2_200", "col3_200"), ("col1_201", "col2_201", "col3_201"), 202: ("col1_202", "col2_202", "col3_202") Execution succeeded (Time spent: 18.185 ms) Thu Feb 20 12:49:44 2020 (user@127.0.0.1:6999) [my_space] > LOOKUP ON lookup_tag_1 WHERE lookup_tag_1.col1 = = "col1_200" = | VertexID | = | 200 |-Got 1 rows (Time spent: 12.001 ms 12.64 ms) Thu Feb 20 12:49:54 2020 (user@127.0.0.1:6999) [my_space] > LOOKUP ON lookup_tag_1 WHERE lookup_tag_1.col1 = = "col1_200" YIELD lookup_tag_1.col1, lookup_tag_1.col2, lookup_tag_1.col3 = = | VertexID | lookup_tag_1.col1 | lookup_tag_1.col2 | lookup_tag_1.col3 | = | 200 | col1_200 | col2_200 | col3_200 |- -Got 1 rows (Time spent: 3.679 ms 4.657 ms) Thu Feb 20 12:50:36 2020 Index practice on how to implement Nebula Graph, a distributed map database, ends here I hope the above content can be of some help to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.