How to get started with Spark-Graph in big data's Development 04/19 Update SLTechnology News&Howtos

How to get started with Spark-Graph in big data's Development

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about how to get started with Spark-Graph in the development of big data. The editor thinks it is very practical, so I share it with you. I hope you can get something after reading this article. Let's take a look at it.

Introduction to 1.Spark Graph

GraphX is a component of Spark, which is specially used to represent graphs and perform parallel computing of graphs. GraphX extends RDD: directed polygraph by redefining the abstract concept of graph, whose attributes are attached to each vertex and edge. To support graph computing, GraphX exposes a series of basic operators (such as mapVertices, mapEdges, subgraph) and optimized Pregel API variants. In addition, more and more graph algorithms and builders are included to simplify the task of graphic analysis. GraphX optimizes the storage of graph vertex information and edge information, so that the performance of the graph computing framework is greatly improved compared with the native RDD implementation, approaching or reaching the performance of professional graph computing platforms such as GraphLab. The greatest contribution of GraphX is to provide a stack of data solutions on top of Spark, which can easily and efficiently complete a complete set of pipelining operations for graph computing.

The mode of figure calculation:

The basic graph computing is based on the BSP mode, and BSP is the whole synchronous parallel, which divides the computation into a series of super-step iterations. Vertically, it is a serial mode, while horizontally, it is a parallel mode, setting a barrier between every two supersteps, that is, the overall synchronization point, to make sure that all parallel calculations are completed and then start the next round of hypersteps.

Each super step consists of three parts: computing compute: each processor uses the message and local data from the previous super step for local computing message delivery: after each processor calculation is completed, pass the message to other processors overall synchronization points associated with it: for overall synchronization, make sure that all calculations and message delivery are completed, and then proceed to the next super step.

two。 Let's look at an example.

Diagram description

# # Vertex data 1, "SFO" 2, "ORD" 3, "DFW" # # Edge data 1,2 18002, 3, 8003, 1, 1400

Calculate all vertices, all edges, all triplets, number of vertices, number of edges, those whose vertex distance is greater than 1000, sort by vertex distance, output in descending order

Code implementation

Package com.hoult.Streaming.workimport org.apache.spark. {SparkConf, SparkContext} import org.apache.spark.graphx. {Edge, Graph VertexId} import org.apache.spark.rdd.RDDobject GraphDemo {def main (args: Array [String]): Unit = {/ / initialization val conf = new SparkConf (). SetAppName (this.getClass.getCanonicalName.init) .setMaster ("local [*]") val sc = new SparkContext (conf) sc.setLogLevel ("warn") / / initialization data val vertexArray: Array [(Long, String)] = Array ((1L, "SFO"), (2L, "ORD") (3L, "DFW")) val edgeArray: array [EdgeInt] = Array (Edge (1L, 2L, 1800), Edge (2L, 3L, 1800), Edge (3L, 1L, 1400)) / / Construction vertexRDD and edgeRDD val vertexRDD: RDD [(VertexId, String)] = sc.makeRDD (vertexArray) val edgeRDD: RDD [EdgeInt] = sc.makeRDD (edgeArray) / structural drawing val graph: Graph [String Int] = Graph (vertexRDD EdgeRDD) / / all vertices graph.vertices.foreach (println) / / all edges graph.edges.foreach (println) / / all triplets graph.triplets.foreach (println) / / find the number of vertices val vertexCnt = graph.vertices.count () println (s "number of vertices: $vertexCnt") / / number of edges val edgeCnt = graph.edges.count () println (s "sides: $edgeCnt") / / graph.edges.filter (_ .attr > 1000) .foreach (println) / / sorted by distance between all airports (descending) graph.edges.sortBy (- _ .attr). Collect (). Foreach (println)}}

Output result

3. Some knowledge about graphs

The example is demo-level. In the actual production environment, if it is used, it must be much more complex than this, but generally speaking, it will only be used in certain scenarios. Note that in the case of intention calculation, you should pay attention to caching data. RDD is not stored in memory by default, so you can use display cache as much as possible. In iterative computing, you may need to cancel caching in order to get the best performance. By default, cached RDD and graph savers are kept in memory until memory pressure forces them to be gradually removed from memory according to LRU [least recently used page swapping algorithm]. For iterative computation, the previous intermediate results will fill the memory. After they are eventually removed from memory, but unnecessary data stored in memory will slow down garbage collection. Therefore, once intermediate results are no longer needed, it is more efficient to uncache intermediate results. This involves implementing a cache graph or RDD in each iteration, uncaching all other data sets, and using only the implemented data sets in subsequent iterations. However, because the diagram is made up of multiple RDD, it is difficult to de-persist correctly. For iterative calculations, it is recommended to use Pregel API, which correctly preserves the intermediate results.

The above is how to get started with Spark-Graph in big data's development. The editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.