How to analyze the one-stop Graph Computing platform GraphScope 07/03 Update SLTechnology News&Howtos

How to analyze the one-stop Graph Computing platform GraphScope

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about how to analyze the one-stop map computing platform GraphScope. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

What is Graph Computing

The graph data models a group of objects (vertices) and their relationships (edges), which can intuitively and naturally represent all kinds of entity objects in the real world and the relationship between them. In the big data scenario, social networks, transaction data, knowledge graph, transportation and communication networks, supply chain and logistics planning are typical examples of graph modeling. Figure 1 shows Alibaba's graph data in an e-commerce scenario, with various types of vertices (consumers, sellers, goods and devices) and edges (representing purchase, view, comment, etc.). In addition, each vertex is associated with a wealth of attribute information.

This kind of graph data in the actual scene usually contains billions of vertices and trillions of edges. In addition to the large scale, the continuous update speed of this graph is also very fast, and there may be nearly a million updates per second. With the continuous growth of the application scale of graph data in recent years, exploring the internal relations of graph data and the calculation on graph data have attracted more and more attention. According to the different goals of graph computing, it can be roughly divided into three types of tasks: interactive query, graph analysis and graph-based machine learning.

1. Interactive query of the graph

In the application of graph computing, businesses usually need to view graph data in an exploratory way in order to locate some problems in time and analyze some in-depth information. For example, the (simplified) graph model in figure 2 (left) can be used for financial anti-fraud (illegal cash out of credit cards) detection. By using forged identifiers, "criminals" can obtain short-term credit from banks (vertex 4). He tried to cash in money with the help of merchants (vertex 3) with false purchases (edge 2-> 3). Once the payment is received from the bank (vertex 4) (edge 4-> 3), the merchant returns the money (through sides 3-> 1 and 1-> 2) to the "criminal" through multiple accounts in his name. This pattern eventually forms a closed loop on the graph (2-> 3-> 1.-> 2). In a real scenario, the scale of graph data online may contain billions of vertices (for example, users) and hundreds of billions to trillions of edges (for example, payment transactions). And the whole fraud process may involve dynamic transaction chains with various constraints between many entities, so complex real-time interaction analysis is needed to identify them well.

2. Graph analysis

The research on graph analysis and calculation has been going on for decades, and many algorithms of graph analysis have been produced. Typical graph analysis algorithms include classical graph algorithms (for example, PageRank, shortest path and maximum flow), community detection algorithms (for example, maximum clique / clique, associated flux calculation, Louvain and label propagation), graph mining algorithms (for example, frequent set mining and pattern matching of graphs). Because of the diversity of graph analysis algorithms and the complexity of distributed computing, distributed graph analysis algorithms often need to follow a certain programming model. The current programming model is a bit central model "Think-like-vertex", matrix-based model and subgraph-based model and so on. Under these models, various graph analysis systems have emerged, such as Apache Giraph, Pregel, PowerGraph, Spark GraphX, GRAPE and so on.

3. Graph-based machine learning

Classic Graph Embedding technologies, such as Node2Vec and LINE, have been widely used in a variety of machine learning scenarios. The graph neural network (GNN) proposed in recent years combines the structure and attribute information of the graph with the features in deep learning. GNN can learn low-dimensional representations for any graph structure (for example, vertices, edges or the whole graph), and the generated representations can be classified, link predicted, clustered and so on by many machine learning tasks related to downstream graphs. Graph learning technology has been proved to have convincing performance on many graph-related tasks. Different from traditional machine learning tasks, graph learning tasks involve operations related to graphs and neural networks (see figure 2, right). Each vertex in the graph uses graph-related operations to select its neighbors. and aggregate the characteristics of their neighbors with the neural network operation.

Graphic computing: the cornerstone of the next generation of artificial intelligence

Not only Alibaba, graph data and computing technology have been the focus of academia and industry in recent years. In particular, in the past decade, the performance of the graph computing system has been improved by 100 times, and the system is still becoming more and more efficient, which makes it possible to speed up the tasks of AI and big data through graph computing. In fact, because the graph can express various complex types of data very naturally, and can provide abstraction for common machine learning models. Compared with dense tensor, graph can provide richer semantics and more comprehensive optimization functions. In addition, graph is a natural expression of sparse high-dimensional data, and more and more studies in graph convolution network (GCN) and graph neural network (GNN) have proved that graph computing is an effective supplement to machine learning and will play a more and more important role in interpretable results and deep reasoning causality.

It can be predicted that graph computing will play an important role in various applications of next-generation artificial intelligence, including anti-fraud, intelligent logistics, urban brain, bioinformatics, public safety, public health, urban planning, anti-money laundering, infrastructure, recommendation systems, financial technology and supply chain.

Third, the present situation of graph calculation

After years of development, there have been a variety of systems and tools for a variety of graph computing requirements. For example, in interactive query, there are graph databases such as Neo4j, ArangoDB and OrientDB, as well as distributed systems and services such as JanusGraph, Amazon Neptune and Azure Cosmos DB; in graph analysis, there are Pregel, Apache Giraph, Spark GraphX, PowerGraph and other systems; in graph learning, there are DGL, pytorch geometric and so on. Nevertheless, in the face of rich graph data and diverse graph scenarios, it is still a huge challenge to effectively use graph computing to enhance business effectiveness:

In real life, graph computing scenarios are diverse, and usually very complex, involving many types of graph computing. Existing systems are mainly designed for specific types of graph computing tasks. Therefore, users must break down complex tasks into multiple jobs involving many systems. There may be a lot of additional overhead between systems, such as integration, IO, format conversion, network, and storage.

It is difficult to develop the application of large-scale graph computing. In order to develop graph computing applications, users usually use easy-to-use tools (such as NetworkX and TinkerPop in Python) to start with small-scale graph data on a machine. However, it is extremely difficult for ordinary users to extend their stand-alone solutions to parallel environments to deal with large-scale diagrams. Existing distributed systems for large-scale diagrams usually follow different programming models and lack a rich ready-to-use algorithm / plug-in library in a single library (such as NetworkX). This makes the threshold of distribution map calculation too high.

The scale and efficiency of dealing with large images is still limited. For example, due to the high complexity of travel mode, the existing interactive graph query system can not execute Gremlin query in parallel. For the graph analysis system, the traditional point-centric programming model makes the existing optimization techniques at the graph level no longer available. In addition, many existing systems are largely unoptimized at the compiler level.

Let's look at the limitations of the existing system through a concrete example.

1 example: paper classification and prediction

Dataset ogbn-mag is a dataset from Microsoft academic. There are four types of points in the data, representing papers, authors, institutions and research fields, and there are four sides expressing the relationship between these points: the author "wrote" the paper, the paper "cited" another paper, the author "belongs to" an institution, and the paper "belongs" to a research field. This data can naturally be modeled with a graph.

A user expects to do a classification task for the "papers" published from 2014 to 2020 on this graph, and expects to classify and predict the topic category of the article according to the structural attributes of the paper in the data graph, its own topic characteristics, and the measurement parameters of aggregation degree, such as kcore, triangular counting triangle-counting, etc. In fact, this is a very common and meaningful task. By taking into account the citation relationship and the topic of the paper, this prediction can help researchers better discover potential cooperation and research hotspots in the field.

Let's decompose this computing task: first, we need to screen the paper and its related points and edges according to the year, and then we need to calculate the whole graph calculation of kcore, triangle-counting and so on on this graph. Finally, these two parameters and the original features on the graph are put into a machine learning framework for classification training and prediction. We find that the existing systems can not solve this problem very well end-to-end, we can only run by organizing multiple systems into a pipeline. This task seems to have been solved, but in fact there are many problems behind this pipelined solution. For example, multiple systems are independent and separated from each other, the intermediate data frequently come down for data transfer between systems; the program of graph analysis is not a declarative language, there is no fixed paradigm; the scale of the graph affects the efficiency of the machine learning framework and so on. These are the problems we often encounter in real-world computing scenarios. To sum up, they can be summarized as follows:

The problem of graph calculation is very complex, the calculation mode is diverse, and the solution is fragmented.

The study of graph calculation is difficult, the cost is high, and the threshold is high.

The scale of the graph is large and the amount of data is large, the calculation is complex and the efficiency is low.

In order to solve the above problems, we have designed and developed an one-stop open source graph computing system: GraphScope.

What is GraphScope

GraphScope is an one-stop graph computing platform developed and open source by the Intelligent Computing Laboratory of Alibaba Dama Institute. Relying on Ali's massive data and rich scenes, and the high-level research of Dharma, GraphScope is committed to providing one-stop and efficient solutions to the above-mentioned challenges of graphic computing in actual production.

GraphScope provides Python client, which can easily connect the upstream and downstream workflows. It has the characteristics of one-stop, convenient development and extreme performance. It has efficient cross-engine memory management, supports Gremlin distributed compilation optimization for the first time in the industry, supports automatic parallelization of algorithms and automatic incremental processing of dynamic graph updates, and provides the ultimate performance of enterprise scenarios. In Alibaba's internal and external applications, GraphScope has proved to achieve important business new value in many key Internet areas (such as risk control, e-commerce recommendation, advertising, network security, knowledge graph, etc.).

GraphScope brings together a number of academic research achievements of Dharma Institute, among which the core technology has won the SIGMOD2017 Best Paper Award, the VLDB2017 Best presentation Award, the VLDB2020 Best Paper nomination Award, and the SAIL Award of the World artificial Intelligence Innovation Competition. The paper on GraphScope's interactive query engine has also been accepted by NSDI 2021 and will be published soon. There are more than ten other research results around GraphScope published in top academic conferences or journals in the field, such as TODS, SIGMOD, VLDB, KDD and so on.

1. Architecture introduction

Figure 5:GraphScope system architecture diagram

The bottom layer of GraphScope is a distributed memory data management system vineyard [1]. Vineyard is also an open source project, it provides an efficient and rich IO interface responsible for interacting with the lower file system, it provides efficient and high-level data abstraction (including but not limited to diagrams, tensor,vector, etc.), supports the management of data partitioning, metadata, etc., and can provide native zero-copy data reading for upper-level applications. It is this that supports the one-stop capability of GraphScope: across engines, graph data is partitioned in vineyard and managed by vineyard.

In the middle is the engine layer, which is composed of interactive query engine GIE, graph analysis engine GAE, and graph learning engine GLE, which we will describe in detail in the following chapters.

At the top are development tools and algorithm libraries. GraphScope provides a variety of commonly used analysis algorithms, including connectivity computing class, community discovery class and numerical computing class such as PageRank, centrality, etc., and will continue to expand the algorithm package to provide analysis capabilities compatible with NetworkX algorithm library on very large-scale diagrams. In addition, it also provides a rich graph learning algorithm package, built-in support for GraphSage, DeepWalk, LINE, Node2Vec and other algorithms.

2. Resolution problem: classification and prediction of papers.

With the one-stop computing platform GraphScope, we can solve the problem in the previous example in a simpler way.

GraphScope provides a Python client that allows data scientists to do all the work related to graph computing in an environment that they are familiar with. After opening Python, we first need to establish a GraphScope session.

Import graphscopefrom graphscope.dataset.ogbn_mag import load_ogbn_magsess = graphscope.sesson () g = load_ogbn_mag (sess, "/ testingdata/ogbn_mag/")

In the above code, we created a session for GraphScope and loaded the graph data.

GraphScope is designed for cloud native design. A session corresponds to a set of k8s resources. The session is responsible for the application and management of all resources in this session. Specifically, behind the user's line of code, session first requests the pod of a back-end master entry Coordinator. Coordinator is responsible for all communication with the Python client, and after completing its initialization, it pulls up a set of engine pod. Each pod in this group of pod has a vineyard instance, which forms a distributed memory management layer; at the same time, each pod has three engines: GIE, GAE, and GLE, whose start and stop status is managed by Coordinator on demand. When this group of pod is pulled up and established a stable connection with the Coordinator, and the health check is completed, the Coordinator will return the status to the client, telling the user that the session has been pulled successfully and that the resources are ready to start mapping or computing.

Interactive = sess.gremlin (g) # count the number of papers two authors (with id 2 and 4307) have co-authoredpapers = interactive.execute ("g.V (). Has ('author',' id', 2). Out ('writes'). Where (_ .in (' writes'). Has ('id', 4307)). Count (). One ()

First of all, we establish an interactive query object interactive on figure g. This object pulls up a set of interactive query engines GIE in the engine pod. Then there is a standard Gremlin query statement in which the user wants to view the collaborative papers of two specific authors. This Gremlin statement is sent to the GIE engine for disassembly and execution.

GIE engine consists of core components such as parallelized Compiler, memory and scheduling management, Operator runtime, adaptive travel policy and distributed Dataflow engine. After receiving the statement of the interactive query, the statement is first split by Compiler and compiled into multiple operators. These operators are driven and executed by the model of distributed data flow. In this process, each computing node that holds the partition data runs a copy of the data stream, processes the data of the partition in parallel, and exchanges the data as needed, thus executing the Gremlin query in parallel.

Under the complex syntax of Gremlin, travel strategy is very important and affects the parallelism of queries, and its choice directly affects the occupation of resources and the performance of queries. Simple BFS or DFS alone can not meet the demand in reality. The optimal travel strategy often needs to be dynamically adjusted and selected according to specific data and queries. GIE engine provides adaptive travel strategy configuration, and selects travel strategy according to query data, disassembled Op and Cost model to achieve high efficiency of operator execution.

# extract a subgraph of publication within a time rangesub_graph = interactive.subgraph ("g.V (). Has ('year', inside (2014, 2020)). OutE (' cites')") # project the projected graph to simple graph.simple_g = sub_graph.project_to_simple (vicious paper = "cites") ret1 = graphscope.k_core (simple_g, kumb5) ret2 = graphscope.triangles (simple_g) # add the results as new columns to the citation graphsub_graph = sub_graph.add_column (ret1 {"kcore": "r"}) sub_graph = sub_graph.add_column (ret2, {"tc": "r"})

After a series of interactive queries with a single point of view, the user starts the graph analysis task through the above statement.

First of all, it extracts a subgraph from the original image according to the filter condition through an operator of subgraph. Behind this operation, the interactive engine GIE executes a query and writes the result graph to vineyard.

Then the user extracts the points of label as the paper and the edges of the relationship between them as cites on the new graph, produces an isomorphic graph, and makes an analytical calculation on the whole graph by calling GAE's built-in algorithm k-core and triangle counting triangles. After the results are produced, the two results are added back to the original graph as attributes on the point. Here, with the help of vineyard metadata management and high-level data abstraction, the new sub_graph is generated by adding a column of transformations on the original graph, and there is no need to reconstruct all the data of the whole graph.

The core of the GAE engine inherits the GRAPE system that won the award for best paper in SIGMOD2017 [2]. It consists of high-performance runtime, automatic parallelization components, multi-language support SDK and other components. The above example uses GAE's own algorithm, in addition, GAE also allows users to easily write their own algorithm and plug and play on it. Users write algorithms with PIE model based on subgraph programming, or reuse existing graph algorithms without considering distributed details, and GAE does automatic parallelization, which greatly reduces the high threshold of distributed graph computing for users. At present, GAE allows users to write their own algorithm logic through C++, Python (Java will be supported later) and other languages, plug and play in a distributed environment. The high-performance runtime of GAE is based on MPI, and the communication, data arrangement and hardware features are carefully optimized to achieve the ultimate performance.

# define the features for learningpaper_features = [] for i in range (128): paper_features.append ("feat_" >

Next, we begin to use the graph learning engine to classify papers. First of all, we configure the 128D features of the paper class nodes in the data and the kcore and triangles attributes we calculated in the previous step as the training features. Then we pull up the graph learning engine GIE from session. When pulling up the graph lg in GIE, we configure the graph data, feature attributes, specify which kind of edges, and divide the point set into training set, verification set and test set.

From graphscope.learning.examples import GCNfrom graphscope.learning.graphlearn.python.model.tf.trainer import LocalTFTrainerfrom graphscope.learning.graphlearn.python.model.tf.optimizer import get_tf_optimizer# supervised GCN.def train_and_test (config, graph): def model_fn (): return GCN (graph, config ["class_num"],...) Trainer = LocalTFTrainer (model_fn, epoch=config ["epoch"]...) Trainer.train_and_evaluate () config = {...} train_and_test (config, lg)

Then we use the above code to select the model and do some training-related parameter configuration can be very convenient to use GLE to start the map classification task.

The GLE engine consists of two parts: Graph and Tensor, which are composed of various Operator. The Graph part deals with the docking of graph data and deep learning, such as iterative by Batch, sampling and negative sampling, etc., and supports homogeneous and heterogeneous maps. The Tensor part is composed of all kinds of deep learning operators. In the computing module, the graph learning task is decomposed into operators, and then the operators are executed distributed at run time. To further optimize sampling performance, GLE caches remote neighbors, frequently accessed points, attribute indexes, and so on, to speed up the lookup of vertices and their attributes in each partition. GLE uses an asynchronous execution engine that supports heterogeneous hardware, which enables GLE to effectively overlap a large number of concurrent operations, such as Icano, sampling and tensor computation. GLE abstracts heterogeneous computing hardware into resource pools (such as CPU thread pool and GPU flow pool) and collaborates to schedule fine-grained concurrent tasks.

V. performance

GraphScope not only solves the problem of graph computing in terms of ease of use, but also reaches the extreme in performance, which meets the needs of enterprises. We used LDBC Benchmark to evaluate and compare the performance of GraphScope.

As shown in figure 6, on the interactive query test LDBC SNB Benchmark, the GraphScope deployed on a single node is more than an order of magnitude faster than the open source system JanusGraph; in distributed deployment, the interactive query of GraphScope can basically achieve linear acceleration scalability.

In the graph analysis test LDBC GraphAnalytics Benchmark, GraphScope takes the lead in almost all combinations of algorithms and data sets compared with PowerGraph and other latest systems. In some algorithms and data sets, there is a minimum performance advantage of five times over other platforms.

The above is the editor for you to share how to analyze the one-stop map computing platform GraphScope, if you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.