How to analyze the graph database 04/20 Update SLTechnology News&Howtos

How to analyze the graph database

2025-04-20 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Database >

Shulou(Shulou.com)05/31 Report--

How to analyze the graph database, I believe that many inexperienced people are at a loss about it. Therefore, this paper summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

The following mainly discusses the design ideas, principles and some applicable scenarios behind the diagram database, as well as specific cases of using the diagram database in the production environment.

Starting from the Social Network

The picture below is a social networking scene where each user can post Weibo, share Weibo or comment on others' Weibo. These are the most basic additions, deletions, modifications and queries, and they are also common operations that most R & D personnel do to the database. In the daily work of the R & D personnel, in addition to entering the basic information of the user into the database, we also need to find the information related to the user to facilitate the next analysis of the individual user. For example, we find that Zhang San's account has a lot of content about AI and music, so we can infer that he may be a programmer and push the content that he may be interested in.

This data analysis happens all the time, but sometimes a simple data workflow may become quite complex when it is implemented, and database performance will decrease sharply with the increase of the amount of data. for example, to obtain a manager's subordinate three-level reporting relationship of employees, this kind of statistical query is a common operation in today's data analysis. This kind of operation often leads to huge differences in performance because of database selection.

The solution of traditional Database Conceptual Model and query Code of traditional Database

Traditionally, the easiest way to solve the above problems is to establish a relational model. We can input the information of each employee into the table, and there is a relational database such as MySQL. The following figure is the most basic relational model:

However, based on the above relational model, to achieve our requirements, it will inevitably involve a lot of relational database JOIN operations, and the query statements will become quite long (sometimes up to hundreds of lines):

(SELECT T.directReportees AS directReportees, sum (T.count) AS countFROM (SELECT manager.pid AS directReportees, 0 AS countFROM person_reportee managerWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") UNIONSELECT manager.pid AS directReportees, count (manager.directly_manages) AS countFROM person_reportee managerWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") GROUP BY directReporteesUNIONSELECT manager.pid AS directReportees Count (reportee.directly_manages) AS countFROM person_reportee managerJOIN person_reportee reporteeON manager.directly_manages = reportee.pidWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") GROUP BY directReporteesUNIONSELECT manager.pid AS directReportees, count (L2Reportees.directly_manages) AS countFROM person_reportee managerJOIN person_reportee L1ReporteesON manager.directly_manages = L1Reportees.pidJOIN person_reportee L2ReporteesON L1Reportees.directly_manages = L2Reportees.pidWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") GROUP BY directReportees) AS TGROUP BY directReportees) UNION Sum (T.count) AS countFROM (SELECT manager.directly_manages AS directReportees, 0 AS countFROM person_reportee managerWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") UNIONSELECT reportee.pid AS directReportees, count (reportee.directly_manages) AS countFROM person_reportee managerJOIN person_reportee reporteeON manager.directly_manages = reportee.pidWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") GROUP BY directReporteesUNIONSELECT depth2Reportees.pid AS directReportees Count (depth3Reportees.directly_manages) AS countFROM person_reportee managerJOIN person_reportee L1ReporteesON manager.directly_manages = L1Reportees.pidJOIN person_reportee L2ReporteesON L1Reportees.directly_manages = L2Reportees.pidWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") GROUP BY directReportees) AS TGROUP BY directReportees) UNION (SELECT T.directReportees AS directReportees, sum (T.count) AS countFROM (SELECT reportee.directly_manages AS directReportees 0 AScountFROM person_reportee managerJOIN person_reportee reporteeON manager.directly_manages = reportee.pidWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName") GROUP BY directReporteesUNIONSELECT L2Reportees.pid AS directReportees, count (L2Reportees.directly_manages) AScountFROM person_reportee managerJOIN person_reportee L1ReporteesON manager.directly_manages = L1Reportees.pidJOIN person_reportee L2ReporteesON L1Reportees.directly_manages = L2Reportees.pidWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName" GROUP BY directReportees) AS TGROUP BY directReportees) UNION (SELECT L2Reportees.directly_manages AS directReportees 0 AS countFROM person_reportee managerJOIN person_reportee L1ReporteesON manager.directly_manages = L1Reportees.pidJOIN person_reportee L2ReporteesON L1Reportees.directly_manages = L2Reportees.pidWHERE manager.pid = (SELECT id FROM person WHERE name = "fName lName"))

This kind of glue code is a disaster for maintainers and developers, and no one wants to write or debug it. In addition, this kind of code is often accompanied by serious performance problems, which will be discussed in detail later.

Performance problems of traditional Relational Database

The essence of the performance problem lies in the amount of data faced by data analysis. if only dozens of nodes or less are queried, this operation does not need to consider database performance optimization at all. however, when the number of node data has changed from hundreds to millions or even tens of millions, database performance has become one of the most important factors in the whole process of product design.

With the increase of nodes, the relationship between users and users, the relationship between users and products, or the relationship between products and products will grow exponentially.

Here are some public data that reflect the reality of the data, the data, and the relationship between the data:

Twitter: the number of users is 500 million, and there is a relationship of following and likes among users.

Amazon: 120 million users, and there is a purchasing relationship between users and products.

AT&T (one of the three major operators in the United States): 100 million numbers, phone numbers can establish a call relationship

As shown in the following table, open source graph data sets often have data of tens of millions of nodes and hundreds of millions of edges:

In a scenario with such a large amount of data, using traditional SQL will cause great performance problems for two main reasons:

The overhead of a large number of JOIN operations: the previous query used a large number of JOIN operations to find the desired results. A large number of JOIN operations will have a huge performance loss when the amount of data is very large, because the data itself is stored in a specified place, and the query itself only needs part of the data, but the JOIN operation itself will traverse the entire database, which will lead to unacceptable query efficiency.

The overhead of reverse query: it doesn't cost much to query the subordinates of a single manager, but if we are going to reverse query an employee's boss and use the table structure, the cost will become very large. The unreasonable design of the table structure will have an impact on the performance of the subsequent analysis and recommendation system. For example, when the relationship changes from boss-> employee to user-> product, if reverse query is not supported, the real-time performance of the recommendation system will be greatly reduced, resulting in economic losses.

The following table lists an unofficial performance test (social networking test suite of 1 million users, each with about 50 friends), reflecting the performance changes in relational databases as the depth of friend queries increases:

The conventional optimization strategy of traditional database: index

Index: the SQL engine uses the index to find the corresponding data.

Common indexes include B-tree index and hash index. Indexing a table is a relatively routine operation to optimize SQL performance. B-tree index is simply to give everyone a sortable independent ID,B- tree itself is a balanced multi-fork search tree, this tree will sort each element according to the index ID, thus supporting range lookup, the complexity of range lookup is O (logN), where N is the number of files indexed.

But the index can not solve all the problems, if the file is updated frequently or there are many duplicate elements, it will lead to a great loss of space, in addition, the IO consumption of the index is also worth considering, the index IO, especially in the mechanical hard disk IO read and write performance is very poor, the conventional B-tree index consumes four IO random reads, when the JOIN operation becomes more and more, the hard disk search is more likely to occur hundreds of times.

Strategy 2: caching

Caching: caching is mainly to solve the performance optimization problems caused by frequent reads of data with spatial or temporal locality. A more common architecture that uses caching is lookaside cache architecture. The following is a previous example of Facebook using Memcached + MySQL (which has been replaced by TAO, a graph database developed by Facebook):

In the architecture, the designer assumes that the content created by the user is much less than what the user reads. Memcached can be simply understood as a distributed hash table supporting additions, deletions, modifications and queries, supporting hundreds of millions of user requests. The basic process is to check the cache before querying the SQL database when the client needs to read the data. When the user needs to write data, the client first deletes the key in the cache, makes the data expire, and then updates the database. But there are several problems with this architecture:

First of all, the key-value cache is not a good operation statement for graph-structured data. If one edge is queried each time, all the corresponding edges of the node need to be taken out from the cache. In addition, when an edge is updated, all the original dependent edges will be deleted, and then all the data of the corresponding edges need to be reloaded. These are concurrency performance bottlenecks. After all, a point in the actual scene is often accompanied by thousands of edges. The time and memory consumption caused by this operation cannot be ignored.

Second, there is a process from data update to data reading, which in the above architecture requires master-slave database cross-domain communication. The original model uses an external identity to record expired key-value pairs and asynchronously passes these read requests from the read-only slave node to the master node, which requires cross-domain communication, which is much more delayed than reading directly from local. (similar to the distance from Beijing to Shenzhen, which used to be a few hundred meters.)

Modeling using graph structure

The main reason for the failure of the above relational database modeling is the lack of inherent correlation between the data. in view of this kind of problem, a better modeling way is to use graph structure.

If the data itself is the structure of the table, the relational database can solve the problem, but if you want to show the relationship between the data and the data, the relational database can not solve the problem. This is mainly caused by a large number of JOIN operations that are inevitable in the process of query, but only part of the data is used in each JOIN operation. Since repeated JOIN operations will cause a lot of performance loss. How can modeling solve the problem better? The answer lies in the relationship between points.

Point, Association and Graph data Model

In our previous discussion, although the traditional database uses JOIN operations to link different tables together, thus implicitly expressing the relationship between the data, when we want to query the results through A, manage B, manage A, the table structure can not directly tell us the results.

If we want to know the corresponding query results before making a query, we must first define nodes and relationships.

Defining nodes and relationships first is the core difference between graph databases and other databases. For example, we can represent managers and employees as different nodes, and use an edge to represent their pre-existing management relationships, or regard users and goods as nodes, model with purchase relationships, and so on. When we need new nodes and relationships, we only need to update a few times without having to change the structure of the table or migrate the data.

Based on the nodes and associations, the previous data can be modeled as shown in the following figure:

To model through the native nGQL graph query language of the graph database Nebula Graph, refer to the following operations:

-- Insert PeopleINSERT VERTEX person (ID, name) VALUES 1: (2020031601, 'Jeff'); INSERT VERTEX person (ID, name) VALUES 2: (2020031602,' A'); INSERT VERTEX person (ID, name) VALUES 3: (2020031603,'B'); INSERT VERTEX person (ID, name) VALUES 4: (2020031604,'C') -- Insert edgeINSERT EDGE manage (level_s, level_end) VALUES 1-> 2: ('0mm,' 1') INSERT EDGE manage (level_s, level_end) VALUES 1-> 3: ('0mm,' 1') INSERT EDGE manage (level_s, level_end) VALUES 1-> 4: ('0mm,' 1')

Previously super-long query statements can also be reduced to short 3 or 4 lines of code through Cypher / nGQL.

The following is the nGQL statement

GO FROM 1 OVER manage YIELD manage.level_s as start_level, manage._dst AS personid | GO FROM $personid OVER manage where manage.level_s

< start_level + 3YIELD SUM($$.person.id) AS TOTAL, $$.person.name AS list 下面为 Cypher 版本 MATCH (boss)-[:MANAGES*0..3]->

(sub), (sub)-[: MANAGES*1..3]-> (personid) WHERE boss.name = "Jeff" RETURN sub.name AS list, count (personid) AS Total

From nearly 100 lines of code to 3 or 4 lines of code, we can clearly see the advantage of graph database in data expression ability.

Figure database performance optimization

The graph database itself specially optimizes the highly connected and structurally weak data. Different graph databases are also optimized according to different scenarios. Here, the author briefly introduces the following graph databases, BTW, which all support original graph modeling.

Neo4j

Neo4j is the most well-known graph database. In the industry, Microsoft and ebay are using Neo4j to solve some business scenarios. The performance optimization of Neo4j has two points: one is the optimization of original graph data processing, and the other is the use of LRU-K cache to cache data.

Optimization of original map data processing

When we say that a graph database supports original graph data processing, it means that the database has the ability to support index-free adjacency.

Index-free adjancency means that each node retains the reference of the connected node, so the node itself is an index of the connected node. The performance of this operation is much better than using the global index. At the same time, if we query according to the graph, this kind of query has nothing to do with the size of the whole graph, but only related to the number of associated edges of the query node. If the complexity of querying with B-tree index is O (logN). The complexity of a query using this structure is O (1). When we want to query multi-tier data, the time required for the query will not increase exponentially with the size of the data set, but will be a relatively stable constant. After all, each query will only find the connected edges according to the corresponding nodes and will not traverse all the nodes.

Main memory cache optimization

The LRU-K cache is used in version 2.2 of Neo4j, which, in short, pops up the least frequently used pages from the cache, favouring more frequently used pages, and this design ensures optimal use of cache resources in a statistical sense.

JanusGraph

JanusGraph itself does not focus on storage and analysis, but implements the interface between the graph database engine and a variety of index and storage engines, using these interfaces to achieve data and storage and indexing. The main purpose of JanusGraph is to support graph data modeling and optimize the details related to graph data serialization, graph data modeling and graph data execution on the basis of the original framework. JanusGraph provides modular data persistence, data indexing and client interface, which makes it easier to apply the graph data model to practical development.

In addition, JanusGraph supports Cassandra, HBase, BerkelyDB as storage engines, and supports data indexing using ElasticSearch, Solr, and Lucene.

In terms of application, you can interact with JanusGraph in two ways:

Make JanusGraph part of the application for querying and caching, and these data interactions are all performed on the same JVM, but the source of the data may be local or elsewhere.

Taking JanusGraph as a service, the client is separated from the server, and the client submits Gremlin query statements to the server to perform the corresponding data processing operations.

Nebula Graph

Here is a brief introduction to the system design of Nebula Graph.

Using KV pairs for graph data processing

Nebula Graph uses vertexID + TagID as the key to store in-key and out-key related data between different partition, this operation ensures high availability on large-scale clusters, and the use of distributed partition and sharding also increases the throughput and fault tolerance of Nebula Graph.

Shared-noting distributed storage layer

Storage Service is designed with the distributed architecture of shared-nothing, and each storage node has multiple local KV storage instances as physical storage. Nebula uses the majority protocol Raft to ensure consistency between these KV stores (because Raft is more concise than Paxo, we chose Raft). Above the KVStore is the graph semantic layer, which is used to transform graph operations into lower-level KV operations.

The graph data (points and edges) are stored in different partition by Hash. The Hash function used here is very straightforward, that is, vertex_id takes the remainder of the partition number. In Nebula Graph, partition represents a virtual dataset, these partition are distributed among all storage nodes, and the distribution information is stored in Meta Service (so all storage nodes and compute nodes can get this distribution information).

Stateless computing layer

Each computing node runs a stateless query computing engine, and the nodes do not have any communication relationship with each other. The compute node only reads meta information from Meta Service and interacts with Storage Service. This design makes it easier for compute layer clusters to be managed or deployed on the cloud using K8s.

There are two forms of load balancing in the computing layer. The most common way is to add a load balancer (balance) to the computing layer. The second method is to configure the IP addresses of all nodes in the computing layer in the client, so that the client can randomly select computing nodes to connect.

Each query computing engine can receive the request from the client, parse the query statement, generate the abstract grammar tree (AST) and pass the AST to the execution planner and optimizer, and finally to the executor for execution.

Graph database is the trend today.

Today, the graph database has received more attention from analysts and consulting firms.

Graph analysis is possibly the single most effective competitive differentiator for organizations pursuing data-driven operations and decisions after the design of data capture. -Gartner

"Graph analysis is the true killer app for Big Data."-Forrester

At the same time, the ranking of the chart database on DB-Ranking also shows the fastest rising trend, which shows that the demand is urgent:

Figure Database practice: not only the Engineering practice of Netflix Cloud Database on Social Network

Netflix uses JanusGraph + Cassandra + ElasticSearch as its graph database architecture, which they use to manage digital assets.

Nodes represent digital products such as movies, documentaries, etc., and the relationship between these products is the edge between nodes.

The current Netflix has about 200 million nodes, more than 70 digital products, and hundreds of query and data updates every minute.

In addition, Netflix also applies the graph database to authorization, distributed tracking, and visualization workflows. For example, visualize the commit,jenkins deployment of Git.

Technical iteration of Adobe

Generally speaking, new technologies are often not favored by large companies at the beginning, and the graph database is no exception. Large companies themselves have a lot of legacy projects, and the user size and usage requirements of these projects make these companies do not dare to take risks to use new technologies to change these stable products. Adobe does an example of iterative new technology here, replacing the old NoSQL Cassandra database with the Neo4j graph database.

The revamped system, called Behance, is a content social platform released by Adobe in 15 years. It has about 10 million users, where people can share their creations with millions of people.

Such a huge legacy system was originally built through Cassandra and MongoDB. Based on the problems left over from history, there are many performance bottlenecks that have to be solved.

The slow reading performance of MongoDB and Cassandra is mainly due to the original system design using the fan-out design pattern-the content published by many concerned users will be distributed to each reader separately, which also leads to a big delay in the network architecture. In addition, the operation and maintenance of Cassandra itself also requires a large technical team, which is also a big problem.

In order to build a flexible, efficient and stable system to provide message feeding and minimize the size of data storage, Adobe decided to migrate the original Cassandra database to Neo4j graph database.

In the Neo4j graph database, a so-called Tiered relationships is used to represent the relationship between users, and the relationship of this edge can be used to define different access states, for example, only some users can see, and only followers can see these basic operations.

The data model is shown in the figure

Using this data model and using the Leader-follower architecture to optimize reads and writes, the platform achieved a significant performance improvement:

The duration of operation and maintenance requirements has decreased by 300% after the use of Neo4j.

Storage requirements have been reduced by 1000 times, with Neo4j requiring only 50 gigabytes of data and Cassandra requiring 50TB.

Only 3 service instances are needed to support the smooth operation of the entire server, compared with 48 before.

The figure database itself provides higher scalability.

In today's big data era, the use of graph database can achieve a huge performance improvement on the original architecture at a small cost. Graph database can not only play a huge promoting role in the fields of 5G, AI and Internet of things, but also can be used to reconstruct the original legacy system.

Although different graph databases may have different underlying implementations, they all fully support the use of a graph to build a data model so that different components can relate to each other. From our previous discussion, this change in the level of data model will greatly simplify the problems faced in many daily data systems, increase the throughput of the system and reduce the requirements of operation and maintenance.

After reading the above, have you mastered the method of how to analyze the graph database? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.