How to use Spark GraphX 07/03 Update SLTechnology News&Howtos

How to use Spark GraphX

2025-07-03 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article introduces the relevant knowledge of "how to use Spark GraphX". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!

Introduction to GraphX

When Spark was young, version 0.5 already came with a small Bagel module that provided functions similar to Pregel. Of course, this version was very primitive, weak in performance and function, and belonged to an experimental product. To version 0.8, in view of the increasing demand for distributed graph computing in the industry, Spark began to have an independent branch: Graphx-Branch, as an independent graph computing module, drawing lessons from GraphLab, began to design and develop GraphX. In version 0.9, this module is officially integrated into the backbone, although it is the alpha version, but it is ready for trial, small bagel Bagel bid farewell to the stage. Version 1.0, GraphX was officially put into production.

It is worth noting that GraphX is still in rapid development, from 0.8 branches to 0.9 and 1.0, each version of the code has a lot of improvement and refactoring, and according to observation, without changing any code logic and running environment, but only upgrading versions, switching interfaces and recompiling, each version can achieve a 10-20% performance improvement. Although there is still a certain gap with the performance of GraphLab, but with the overall integrated assembly line processing of Spark, the enthusiastic activity of the community and the speed of rapid improvement, it has a strong competitiveness.

Distributed graph calculation

Before formally introducing GraphX, let's take a look at the general distributed graph computing framework. To put it simply, the purpose of the distributed graph computing framework is to package the various operations of the mega graph as a simple interface to make complex problems such as distributed storage and parallel computing transparent to the upper layer. As a result, engineers of complex networks and graph algorithms can focus more on the design and use of graph-related models without paying attention to the underlying distributed details. In order to achieve this, two general problems need to be solved.

1. Graph storage mode

Generally speaking, the storage of giant graph can be divided into two ways: edge segmentation and point segmentation. Launched in 2013, GraphLab2.0 changed its storage mode from edge segmentation to point segmentation, and achieved a significant improvement in performance. at present, it is widely accepted and used in the industry.

Edge split (Edge Cut)

Each vertex is stored once, but some edges are broken and assigned to two machines. The advantage of this is to save storage space, but the disadvantage is that for the edge-based calculation of the graph, for the edge in which two vertices are divided into different machines, the data is transmitted across the machine, and the internal network communication traffic is large.

Point split (Vertex Cut)

Each edge is stored only once, and it only appears on one machine. The points with many neighbors will be copied to multiple machines, which will increase the storage overhead and cause the problem of data synchronization. The advantage is that the intranet traffic can be greatly reduced.

Originally, the two methods have their advantages and disadvantages, but now point segmentation has the upper hand, and various distributed graph computing frameworks have turned their underlying storage form into point segmentation. There are two main reasons:

With the decline in the price of disks, storage space is not a problem, but there is no breakthrough in intranet communication resources. in cluster computing, intranet bandwidth is precious, and time is more precious than disks, which is similar to the common space-for-time strategy.

In the current application scenarios, the vast majority of networks are "scale-free networks", which follow the power law distribution, and the number of neighbors at different points is very different. Edge segmentation will make most of the edges connected by those multi-neighbor points will be assigned to different machines, such data distribution will make the internal network bandwidth more stretched, so the edge segmentation storage method will be gradually abandoned.

two。 Graph computing model

The current graph computing framework basically follows the BSP computing model. BSP, whose full name is Bulk Synchronous Parallell, was proposed by Leslie Valiant of Harvard University and Bill McColl of Oxford University. In BSP, a computing process consists of a series of global oversteps, each of which consists of three steps: concurrent computing, communication, and fence synchronization. Synchronous completion marks the completion of this overstep and the beginning of the next overstep.

The BSP pattern is very concise. Based on the BSP pattern, there are two mature graph computing models:

Pregel Model-- "think like a Vertex"

In 2010, Google's three new carriages, the Caffeine, Pregel and Dremel, were released. Along with Pregel,BSP model is widely known. It is said that the name of Pregel is to commemorate the problem of Euler's seven bridges, and the river where the seven bridges are located is called Pregel.

Drawing lessons from the idea of MapReduce, Pregel puts forward a graph computing model of "thinking like vertices (Think Like A Vertex)", so that users do not need to consider the details of parallel and distributed computing, but only need to implement a vertex update function, which can be called by the framework when traversing vertices.

Common code templates are as follows:

Void Compute (MessageIterator* msgs) {

/ / traverse the list of messages passed in by vertices into the edge

For (;! msgs- > Done (); msgs- > Next ())

DoSomething ()

/ / generate a new vertex value

* MutableVertexValue () =.

/ / generate a message sent along the exit edge of the vertex

SendMessageToAllNeighbors (...)

}

Although this model is simple, it is easy to find its defects. For vertices with a large number of neighbors, the messages that need to be processed are very large, and in this mode, they cannot be processed concurrently. Therefore, for the natural graph which accords with the power law distribution, it is easy to fake death or collapse under this calculation model.

GAS model-neighbor update model

Compared with the message communication paradigm of the Pregel model, GraphLab's GAS model prefers the shared memory style. It allows the user's custom function to access the entire neighborhood of the current vertex and can be abstracted into the three stages of Gather,Apply,Scatter, often referred to as GAS. There are three independent functions that the corresponding user needs to implement: gather, apply and scatter.

Common code templates are as follows:

/ / collect data from neighboring points and edges

Message gather (Vertex u, Edge uv, Vertex v) {

Message msg =...

Return msg

}

/ / Summary function

Message sum (Message left, Message right) {

Return left+right

}

/ / Update vertex Master

Void apply (Vertex u, Message sum) {

U.value =...

}

/ / Update adjacent edges and neighboring points

Void scatter (Vertex u, Edge uv, Vertex v) {

Uv.value =...

If ((| u.delta | > ε) Active (v))

}

Because the gather/scatter function takes a single edge as the operation granularity, then for many adjacent edges of a vertex, the gather/scatter function can be called independently by the corresponding worker. This design is mainly designed to adapt to the graph storage mode of point segmentation, so as to avoid the problems encountered by the Pregel model.

The framework of GraphX

In the design of GraphX, point segmentation and GAS have been mature, so GraphX stood on the shoulders of giants from the beginning, and optimized these problems in design and coding to find the best balance between function and performance.

Every Spark submodule, like Spark itself, has a core abstraction. The core abstraction of GraphX is Resilient Distributed Property Graph, a directed multigraph with attributes on both vertices and edges. It extends the abstraction of Spark RDD, with both Table and Graph views, and requires only one physical storage. Both views have their own unique operators for flexible operation and execution efficiency.

Like Spark, the code for GraphX is still very concise. The core GraphX code only has more than 3, 000 lines, while the Pregel model implemented on it only has more than 20 lines. The overall code structure of GraphX is as follows:

As a whole, it is still clear that most of the impl packages are optimized and implemented around Partition. To some extent, this shows that the storage of point segmentation and the corresponding computational optimization are indeed the key and difficult points of the graph computing framework.

Design Essentials of GraphX

There are several key points in the underlying design of GraphX

All operations on the Graph view are eventually converted to the RDD operation of its associated Table view. In this way, the calculation of a graph is logically equivalent to a series of RDD conversion processes. So, in fact, Graph finally has three key features of RDD: Immutable,Distributed,Fault-Tolerant. The most important thing is Immutable. The transformation and operation of all graphs logically produce a new graph. Physically, Graphx will have a certain degree of reuse optimization of invariant vertices and edges, which is transparent to users.

The physical data shared at the bottom of the two views consists of two RDD, RDD [VertexPartition] and RDD [EdgePartition]. Points and edges are not actually stored in the form of table Collection [tuple], but by VertexPartition/EdgePartition, which stores a fragment of data with an indexed structure internally to speed up traversal in different views. The constant index structure is shared in the RDD conversion process, which reduces the computing and storage overhead.

The distributed storage of the graph adopts the point partition mode, and uses the partitionBy method, and different partition strategies (PartitionStrategy) are specified by the user. The partitioning strategy assigns edges to each EdgePartition, and vertex Master assignment to each VertexPartition,EdgePartition caches the Ghost copy of the associated points of the local edge. Different partition strategies will affect the number of Ghost copies to be cached and the balance of the edges allocated by each EdgePartition. It is necessary to select the best Strategy according to the structural characteristics of the graph. At present, there are four strategies of EdgePartition2d,EdgePartition1d,RandomVertexCut,CanonicalRandomVertexCut. According to the results of the current experiment, in most scenarios of Taobao, EdgePartition2d works best.

Graph operator of GraphX

Like Spark, GraphX's Graph class provides a rich set of graph operators, roughly as follows:

Specific description and usage of each method, you can find a detailed description of each function in the official GraphX Programming Guide, not one by one. Focus on several methods that should be paid attention to:

Cache of the graph

Because a graph is made up of three RDD, it takes up more memory. The cache,unpersist and checkpoint of the corresponding pictures need to pay more attention to the use of skills. In order to maximize the concept of edge reuse, the default interface of GraphX only provides the method of unpersistVertices. If you want to release the edge, you need to call the g.edges.unpersist () method to release it, which brings some inconvenience to the user, but provides convenience and space for the optimization of GraphX.

Referring to the Pregel code of Graphx, the current best practice for a large image is:

Var gathers...

Var prevG: Graph [VD, ED] = null

While (...) {

PrevG = g

G = g. (...)

G.cache ()

PrevG.unpersistVertices (blocking=false)

PrevG.edges.unpersist (blocking=false)

}

Basically, according to the invariance of graph in GraphX, after g is manipulated and assigned back to g, g is no longer the original g and will be used in the next iteration, so it must be cache. In addition, you must first use prevG to retain the reference to the original image, and quickly release the old image completely after the new image is generated. Otherwise, a large image, after several rounds of iteration, will have the problem of memory leak, which will quickly run out of job memory.

MrTriplet-- adjacent edge polymerization

The full name of mrTriplets is mapReduceTriplets, and it is the most core and powerful interface in GraphX. Pregel is also based on it, so the optimization of it can greatly affect the performance of the whole GraphX.

The simplified definition of the mrTriplets operator is:

Def mapReduceTriplets [A] (

Map: EdgeTriplet [VD, ED] = > Iterator [(VertexId, A)]

Reduce: (a, A) = > A)

: VertexRDD [A]

Its calculation process is as follows:

Map: applied to each triplet to generate one or more messages targeting any one or two of the two vertices associated with the triplet

Reduce: applies to every Vertex, merging messages sent to each vertex

The last thing mrTriplets returns is a VertexRDD [A], which contains the message after each vertex aggregation (type A), and the vertices that do not receive the message are not included in the returned VertexRDD.

In recent releases, GraphX has made the following optimizations for it, which have a significant impact on the performance of Pregel and all upper algorithm toolkits. These include:

Caching for Iterative mrTriplets & Incremental Updates for Iterative mrTriplets

In many graph analysis algorithms, the convergence rates of different points vary greatly. At the end of the iteration, only a few points are updated. Therefore, for the points that are not updated, the EdgeRDD does not need to update the local cache of the corresponding point values during the next mrTriplets calculation, which can greatly reduce the communication overhead.

Indexing Active Edges

Without updated vertices, there is no need to resend messages to neighbors during the next iteration. Therefore, when mrTriplets traverses an edge, if the neighbor point value of an edge is not updated in the previous iteration, it can be skipped directly, avoiding a lot of useless computation and communication.

Join Elimination

A triplet is a triple consisting of an edge and its two neighboring points. Map functions that operate on triplet often only need to access one of its two neighbor point values. For example, in PageRank calculation, the update of a point value is only related to the value of its source vertex, and has nothing to do with the value of the destination vertex it points to. Then in mrTriplets computing, the 3-way join of VertexRDD and EdgeRDD is not needed, only 2-way join is needed.

All these optimizations make the performance of GraphX close to that of GraphLab. Although there is still a certain gap, but integrated pipeline services, and rich programming interfaces, can make up for a slight gap in performance.

Evolutionary Pregel Computing Model

The Pregel interface in Graphx does not strictly follow Pregel's model, it is an improved Pregel model with reference to GAS. The definition is as follows:

Def pregel [A] (initialMsg: a, maxIterations: Int, activeDirection: EdgeDirection) (

Vprog: (VertexID, VD, A) = > VD

SendMsg: EdgeTriplet [VD, ED] = > Iterator [(VertexID,A)]

MergeMsg: (a, A) = > A)

: Graph [VD, ED]

The biggest difference between this Pregel model based on mrTrilets method and the standard Pregel is that its second parameter body accepts three function parameters, but not messageList. Instead of traversing messages on a single vertex, it aggregates the messages received by multiple ghost copies of the vertex, sends them to the master copy, and uses the vprog function to update the point value. The receiving and sending of messages are parallelized automatically, so there is no need to worry about the problem of super nodes.

Common code templates are as follows:

/ / Update vertices

Vprog (vid: Long, vert: Vertex, msg: Double): Vertex = {

V.score = msg + (1-ALPHA) * v.weight

}

/ / send a message

SendMsg (edgeTriplet: EdgeTriplet […]) Iterator [(Long, Double)]

(destId, ALPHA * edgeTriplet.srcAttr.score * edgeTriplet.attr.weight)

}

/ / merge messages

MergeMsg (v1: Double, v2: Double): Double = {

V1+v2

}

You can see why GraphX designed this model. It combines the advantages of both Pregel and GAS, that is, the interface is relatively simple, and the performance is guaranteed. It can cope with the graph storage mode of point segmentation, and is competent for large-scale computing of natural graphs with power-law distribution. It is also worth noting that the official Pregel version is the simplest, and it is a common practice to extend a custom Pregel based on this version for complex business scenarios.

Graph algorithm toolkit

GraphX also provides a set of graph algorithms to facilitate users to analyze the graph. At present, the latest version has supported six classical graph algorithms, such as PageRank, number triangle, maximum connected graph, shortest path and so on. The code implementation of these algorithms focuses on generality. If you want to get the best performance, you can refer to its implementation, modify and extend it to meet business needs. In addition, studying these codes is also a good way to understand the Best Practice of GraphX programming. It is recommended that students who are interested in in-depth study of the development of distributed graph algorithms read through it.

GraphX in Taobao 1. Atlas physical examination platform

Basically, all relationships can be viewed and dealt with from a graph point of view, but how much is the value of a relationship? Healthy or not? What scenarios are suitable for use? Most of the time, operations and products are judged and evaluated by feeling. How to refine and standardize the indicators of various maps, conduct pre-research guidance on the data for the conception of products and operations, and provide the basis for scientific decision-making is the original intention and starting point of the design of the atlas physical examination platform.

Based on this starting point, with the help of GraphX's rich interfaces and toolkits, we develop an atlas physical examination platform for a wide range of map business needs within Taobao. At present, the following indicators are mainly checked:

Degree distribution

Degree distribution is not only the most basic index of a graph, but also a very important index. The main purpose of degree distribution detection is to understand the number and scale of "super nodes" in the graph, as well as the distribution curve of all node degrees. The existence of super nodes will have a significant impact on a variety of propagation algorithms, whether positive or negative resistance, so it is necessary to have an estimate of these data in advance. With the help of GraphX's most basic graph information interface: degrees: VertexRDD [Int], including inDegrees and outDegrees, this indicator can be easily calculated and various statistics can be carried out.

Number of two-hop neighbors

For most social relationships, only one-hop degree distribution is not enough. Another important indicator is the number of two-hop neighbors. For example, in secret App, the secrets of friends' friends spread more widely and the amount of information is more abundant. Therefore, the statistics of the number of two-hop neighbors is a very important index in the physical examination of the map. The computing GraphX of the second-hop neighbor does not appear as an interface, so it needs to be designed and developed by ourselves. The methods currently used are:

In the first traversal, all points propagate a message with its own Id and health of 2 to the neighboring points.

In the second traversal, all points will forward the messages received to the neighboring points again, and the health value is 1.

Finally, the Id with a health value of 1 is received at all points, and the two-hop neighbors of all points are obtained by grouping and summarizing.

It is worth noting that before this calculation, we need to borrow the degree distribution, remove the super nodes in the graph, and do not include the calculation of the number of two-hop neighbors. Otherwise, these super nodes will appear after the first round of transmission and burst after receiving too many messages, and on the other hand, if they participate in the calculation, they will affect the vertices that have an one-hop neighbor relationship with them, so that they can not get a really effective number of two-hop neighbors. So it has to be screened out first.

Connected graph

The purpose of detecting connected graphs is to find out how many connected parts there are in a graph and how many vertices there are in each connected part. In this way, a large image can be divided into multiple small images, and the fragmentary connected parts can be removed, so that more elaborate operations can be carried out on multiple sub-graphs. At present, GraphX provides ConnectedComponents and StronglyConnectedComponents algorithms, which can be used to quickly calculate the corresponding connected graphs.

Connected graph can be further evolved into a community discovery algorithm, and one of the criteria for judging the quality of this algorithm is to calculate the Q value of the module to view the so-called modularity. However, there is still no function in GraphX to calculate the Q value, and we have implemented one, which will be submitted to the community later.

More metrics, such as Triangle Count and K-Core, whether with the help of existing functions in GraphX or developing from scratch, are under way. At present, this atlas physical examination platform has begun to take shape. through the establishment and promotion of the platform, map-related products and businesses have gradually embarked on the road of data-based operation of "no data, no discussion, and use indicators to estimate the effect". Effectively improve the efficiency of communication and prepare for the scientific and systematic development of various map-related business.

two。 Multi-graph merging tool

On the basis of the atlas physical examination platform, we can understand the characteristics of various relationships. Different relationships have their own strengths and weaknesses. for example, some relationship graphs are more connected and some relationship maps are more social, so we often need to use relationship A to enrich relationship B. For this reason, on the map physical examination platform, with the help of GraphX, we have developed a multi-map merging tool, which provides a concept similar to the union of graphs, which can quickly merge two different relationship maps to produce a new relationship map.

Taking the graph based on A relation to expand the graph based on B relation and generating the extended graph C as an example, the basic idea of the fusion algorithm is as follows:

If both vertices of an edge in graph B are in graph A, then add the edge to C graph (such as BD edge)

If one vertex of an edge in graph B is in graph An and the other vertex is not, add both the edge and the other vertex (such as CE edge and E point).

If the two vertices of an edge in graph An are not in graph B, then discard the edge and vertex (such as EF edge)

You can easily do this by using graph operators such as GraphX's outerJoinVertices. In addition, when considering graph merging, we can also consider adding different weights to the edges of different graphs and comprehensively consider the importance of different relationships between points. The newly generated map will conduct another round of map physical examination, through the comparison of the physical examination indicators of the three maps before and after, we can have a prediction and judgment for the effect after the business is online. If it does not meet expectations, you can try to re-select the expansion scheme.

3. Energy propagation model

Energy propagation on weighted networks is one of the classical graph models, which can be used to predict users' reputation. The idea of the model is: birds of a feather flock together. Often trade with users with high reputation, the credibility is naturally high, and often have business dealings with users with poor reputation, the credibility is naturally low. The model is not complex, but there are hundreds of millions of user points and billions of relational edges in Taobao. It is a great test for the performance and function of the graph computing framework to propagate the energy of such a large graph and fine adjust the weight of the edges. With the help of GraphX, we have struck a balance between these two points and successfully implemented the model.

Process such as figure 4, Mr. into a user as the point, the trading relationship as the edge of the giant map initGraph, to select seed users, respectively, give the same initial positive and negative energy value (TrustRank & BadRank), and then carry out two rounds of random walk, a round of good seeds spread positive energy (tr), a round of bad seeds spread negative energy (br), and then subtract positive and negative energy to get finalRank, according to finalRank to judge whether the user is good or bad. The initial propagation intensity of the edge is 0.85, when the AUC is very low, you need to give each edge, with one by multiple features (number of transactions, amount …... ) the weight of the combination. Each feature has a different independent weight and offset. By using the partialDerivativeAUC method, the AUC is calculated on the training set, and then the partial derivative of the AUC is calculated, the independent weights and offsets of each relational dimension are obtained, a new weight regulator (WeightAdjustor) is generated, the weights on all edges of the graph are updated, and then a new round of iteration is carried out, so that the calculation is terminated when the AUC is stable.

After three rounds of large iterations on nearly full data, each round of 2 to 6 Pregel, each Pregel for about 30 small iterations, the final AUC is improved from 0.6 to 0.9, achieving a good user prediction accuracy. The training time is about 6 hours, which exceeds the expectations of the business side in terms of performance and accuracy.

The prospect of future graph computing

After more than half a year of trying, we already have a clear idea of the scale and performance of graph computing that GraphX can do. Some graph models that wanted to do before, but could not be implemented because of insufficient computing power, are no longer a problem. We will further implement more and more graph models on GraphX.

These models can improve user stickiness and activity when applied to community discovery, user influence, energy transmission and tag dissemination of user network, while tag reasoning in the field of recommendation, crowd division, age prediction, commodity trading time sequence jump, can improve the richness and accuracy of recommendation. The world of complex network and graph computing is vast, there are more unknown waiting for us to explore and practice, with the help of Spark GraphX, we can meet greater challenges in the future.

That's all for the content of "how to use Spark GraphX". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.