How to use GraphX based on spark 05/02 Update SLTechnology News&Howtos

How to use GraphX based on spark

2025-05-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces "how to use GraphX based on spark". In daily operation, I believe many people have doubts about how to use GraphX based on spark. The editor consulted all kinds of materials and sorted out simple and easy-to-use operation methods. I hope it will be helpful to answer the doubts about "how to use GraphX based on spark". Next, please follow the editor to study!

GraphX.pptx based on Spark

1. Property Graph: a user-defined digraph in which a user-defined object is attached to each vertex and edge, allowing multiple edges to exist in parallel between two vertices. Each vertex has a 64-bit unique identity (VertexID), and GraphX does not force the VertexID to be ordered. Each edge is identified by the start and end VertexID.

Graph has two parametric types: Vertex (VD) and Edge (ED), which correspond to objects attached to vertices and edges, respectively. When VD and ED are basic data types, Graph saves them in an array.

Graph, like RDD (the basic data type of spark, Resilient Distributed Dataset), cannot be changed after creation, is distributed and stored on the cluster, and is fault tolerant. Changes to the structure and values in the figure will require the generation of a new Graph object, and the new Graph will share most of the data structure with the previous Graph. Graph uses vertex segmentation method to split on different machines. The failure of any machine in which the data shard is located will cause the data shard to be recreated on other machines.

Logically, Graph contains VertexRDD and EdgeRDD, that is:

Class Graph [VD,ED] {

Val vertices: VertexRDD [VD]

Val edges: EdgeRDD [ED,VD]

}

Among them, VertexRDD [VD] and EdgeRDD [ED,VD] are the optimized (extends) versions of RDD [VertexID,VD] and RDD [Ed] respectively, which provide the related functions of graph calculation and make internal optimization.

2. Member variables of the Graph class

Basic information of class Graph [VD, ED] {/ / Graph: number of edges, number of vertices, degree of entry, degree, degree val numEdges: Long val numVertices: Long val inDegrees: VertexRDD [Int] val outDegrees: VertexRDD [Int] val degrees: VertexRDD [Int] / / Graph vertex RDD, edge RDD, and triple RDD val vertices: VertexRDD [VD] val edges: EdgeRDD [ED, VD] val triplets: RDD [EdgeTriplet [VD, ED]}

Class Graph [VD, ED] {def mapVertices [VD2] (map: (VertexId, VD) = > VD2): Graph [VD2, ED] def mapEdges [ED2] (map: Edge [ED] = > ED2): Graph [VD, ED2] def mapTriplets [ED2] (map: EdgeTriplet [VD, ED] = > ED2): Graph [VD, ED2]} each of these operations will change the vertex and edge properties in Graph and generate a new Graph

Class Graph [VD,ED] {def reverse: Graph [VD,ED] def subgraph (epred: EdgeTriplet [VD,ED] = > Boolean, vpred: (VertexId, VD) = > Boolean): Graph [VD,ED] def mask [VD2, ED2] (other: Graph [VD2, ED2]): Graph [VD,ED] def groupEdges (merge: (ED, ED) = > ED): Graph [VD,ED]}

The reverse operation returns a new graph that reverses the direction of the edges in the desired graph. Since this operation does not change the properties of vertices and edges, there is no need for data movement

The subgraph operation returns a subgraph consisting of connected points and edges that satisfy vpred and epred.

The mask operation returns the subgraph in which the two graphs intersect, and the groupEdges operation merges the duplicate edges.

5. Connection operation

Class Graph [VD, ED] {def joinVertices [U] (table: RDD [(VertexId, U)]) (map: (VertexId, VD, U) = > VD): Graph [VD, ED] def outerJoinVertices [U, VD2] (table: RDD [(VertexId, U)]) (map: (VertexId, VD, Option [U]) = > VD2): Graph [VD2, ED]}

JoinVertices operation, connect the vertex and the input RDD, and then apply the user-defined map function to the connected vertex. If there is no matching connected vertex in RDD, the original value of the vertex will remain unchanged.

OuterjoinVertices is similar to joinVertices, except that the user-defined map function applies to all vertices and can change the type of vertices

Where f (a) (b) is written similarly to f (a), except that the type of parameter b depends on a.

6. Neighborhood aggregation

In GraphX, the deeply optimized core aggregation operation is mapReduceTriplets

Class Graph [VD, ED] {def mapReduceTriplets [A] (map: EdgeTriplet [VD, ED] = > Iterator [(VertexId, A)], reduce: (a, A) = > A): VertexRDD [A]}

MapReduceTriplets receives a user-defined map function, applies it to each triple of Graph, and generates a message (message) to any vertex in the triple. In order to facilitate pre-aggregation optimization, sending messages to only one of the vertices is temporarily supported. The user-defined reduce function then combines messages sent to each vertex. Finally, VertexRDD [A] is returned, and the vertices that have not received the message are not included in the result.

MapRedeceTriplets also contains an optional parameter: activeSetOpt, which specifies the set of vertices to perform the map operation

7. In spark, RDD will not always be saved in memory by default. In order to avoid double calculation, you need to explicitly specify: Graph.cache (), which explicitly specifies that RDD stored in memory will only be forced to use LRU (least recently uesd) mode to call out memory when the system is out of memory. However, for iterative calculation, the intermediate data generated by uncaching iteration should be used. Therefore, in the iterative calculation of the graph, it is recommended to use Pregel API, which will automatically unpersist the intermediate results that are not needed.

8. GraphX Pregel API

A graph is naturally a recursive data structure, and the characteristics of vertices in a graph depend on the characteristics of their neighborhood vertices, which in turn affect the characteristics of their neighborhood vertices. Therefore, many important graph algorithms need to iteratively calculate the characteristics of each vertex until convergence. GraphX provides operations similar to Pregel, which is a combination of Google Pregel and GraphLab framework abstractions.

Class GraphOps [VD, ED] {def pregel [A] (initialMsg: a, / / initial message, maximum number of iterations Message passing direction maxIter: Int = Int.MaxValue, activeDir: EdgeDirection = EdgeDirection.Out) (vprog: (VertexId, VD, A) = > VD, sendMsg: EdgeTriplet [VD, ED] = > Iterator [(VertexId, A)], mergeMsg: (a, A) = > A): Graph [VD, ED] = {var g = mapVertices ((vid, vdata) = > vprog (vid, vdata, initialMsg)). Cache () var messages = g.mapReduceTriplets (sendMsg) MergeMsg) var activeMessages = messages.count () var I = 0 while (activeMessages > 0 & & I

< maxIterations) { val newVerts = g.vertices.innerJoin(messages)(vprog).cache() g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) =>

NewOpt.getOrElse (old)} .cache () messages = g.mapReduceTriplets (sendMsg, mergeMsg, Some ((newVerts, activeDir)) .cache () activeMessages = messages.count () I + = 1} g}}

9. Create Graph

GraphX provides a way to create graphs based on vertices and edges RDD or from disk. By default, the graph builder does not re-split the edges of the graph, that is, the edges will remain on the machine where they start the shard. However, Graph.groupEdges requires that the graph be refragmented because this operation assumes that the same edges are in the same shard. Therefore, you need to call the Graph.partitionBy operation first.

GraphLoader.edgeListFile operation, load the graph from disk, parse sourceVD destinationVD, skip the comment line starting from #. The vertex value defaults to 1.

10.VertexRDD and EdgeRDD

GraphX provides VertexRDD and EdgeRDD for Graph, and because GraphX optimizes the data structure of vertices and edges, it also provides some additional functionality. Vertex [A] inherits from RDD [VertexID,A], and the constraint VertexID can only appear once, using a hash table to store vertex attributes A. EdgeRDD inherits from RDD [Edge [Ed]] and saves edges in blocks according to policy PartitionStrategy. In each block, the structure and attributes of the edges are stored in a different structure.

At this point, the study on "how to use spark-based GraphX" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.