What is the practice and customized development of Nebula Graph on the scale of large-scale data? 07/08 Update SLTechnology News&Howtos

What is the practice and customized development of Nebula Graph on the scale of large-scale data?

2025-07-08 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

This article will explain in detail the practice and customized development of Nebula Graph on the scale of large-scale data. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Graph data has promising prospects in the fields of social recommendation, multi-hop real-time computing, risk control and security. How to use graph database to store and query large-scale heterogeneous graph data efficiently is a major challenge. This paper describes the problems encountered in the practice of open source distributed map database Nebula Graph, and realizes the characteristics of big data set storage, small-time full import, multi-version control, second rollback, millisecond access and so on.

Background

Most of the well-known graph databases are at a loss on the big data collection, such as: the community version of Neo4j, which uses Cypher language and is provided by a single copy, is widely used in the field of atlas. Internet companies can only use it under small data sets, and they also have to solve the problem of consistent disaster recovery of multiple copies of Neo4j. Although JanusGraph solves the problem of big data collection storage by means of external metadata management, kv storage and index, it has widely criticized performance problems. We see that when comparing the performance of most graph databases, they will mention that there is a performance improvement of more than dozens of times compared with JanusGraph.

Internet companies facing the challenge of large amount of data generally go to the road of self-research, and only support limited query semantics in order to meet the business needs. How do domestic mainstream Internet companies solve the challenge of map database:

Ant Financial Services Group:

Financial-level graph database, through custom class language to provide services for business side, full calculation push down, provide millisecond delay. It is mainly applied to the following scenarios:

Financial risk control scenario: trillion-level side capital network, storage of real-time transaction information, real-time fraud detection.

Recommendation scenario: stock and securities recommendation.

Ant forest: trillions of graph storage capacity, low latency and strong consistent relational data query and update.

GNN: used for hour-level GNN training. Try dynamic graph GNN online reasoning.

IGraph is a graph index and query system, which stores users' behavior information and is one of the four carriages in Ali data. Provide real-time query of e-commerce map for business side through Gremlin language.

By adding a unified cache layer to kv, ByteGraph splits relational data into B+ trees to cope with efficient edge access and sampling, similar to Facebook's TAO [6].

...

Where does the architecture diagram practice begin?

We chose to start our trip to the graph database from Nebula Graph [4]. What attracts us are the following:

Data sets are fragmented, each edge is stored independently, and the storage potential of very large-scale data sets.

Customized strong consistent storage engine has the potential of computing push-down and MMP optimization.

The founding team has rich experience in graph database, and the abstract idea of the model under the collection of big data has been verified.

The problem of memory explosion in practice

In essence, this is a problem of performance VS resources, in applications with large data scale, memory occupation is a problem that can not be ignored. RocksDB memory consists of three parts: block cache, index and bloom filter, iter pined block.

Block cache optimization: global LRU cache is used to control the cache usage of all rocksdb instances on the machine.

Bloom filter optimization: an edge is designed to store a kv into rocksdb. If all key stores bloom filter, and each key takes up 10bit space, then the entire filter memory takes up much more memory than machine memory. We observe that most of our request patterns are to get the edge list of a point, so using the prefix bloom filter; index to point attribute layer can actually speed up most requests. After this optimization, the memory occupied by the stand-alone filter is G, and the access speed of most requests is not significantly reduced.

Multi-version control

In practice, the graph data needs to be rolled back quickly, imported regularly, and automatically access the latest version of the data. We can roughly divide data sources into two types:

Periodic data: for example, a list of similar users is calculated on a daily basis, and the data takes effect after import.

Historical data + real-time data: for example, historical data is refreshed on a daily basis and merged with real-time written data into full data.

The following is the storage model of the data in rocksdb:

Vertex storage format

Edge storage format

The real-time data written in version is recorded as a timestamp. The data version imported offline needs to be specified by yourself. We use this field in conjunction with the offline import module for version control with three configuration items: reserve_versions (list of versions to be retained), active_version (version number requested by the user), and max_version (retain data after a certain version and merge historical data with real-time write data). This allows efficient management of both offline and online data, and data that is no longer in use will be erased from disk in the next compaction.

In this way, the business code can update the data version without feeling and roll back in seconds.

For example:

Keep 3 versions and activate one of them:

Alter edge friend reserve_versions = 1 2 3 active_version = 1

The data source is historical data + real-time imported data.

Alter edge friend max_version = 1592147484 Fast batch Import

In practice, it is a routine operation to import a large amount of data. if the data that needs to be imported is converted into a request to send to the graph database without any optimization, it not only seriously affects the online request, but also takes more than a day to import a large amount of data. It is urgent to optimize the import speed. The industry generally uses SST Ingest to solve this problem [5]. In a similar way, we routinely schedule spark tasks to generate disk files offline. Then the data node pulls the data it needs, ingest it to the database, and then performs version switching control to request access to the latest version of the data.

The whole process is imported quickly, and the whole process is completed in about a few hours. The calculation process is mainly completed offline and has little impact on the request of the graph database.

This is the end of the practice and customized development of Nebula Graph on the scale of large-scale data. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.