What is the core structure of Apache Spark2.0 04/21 Update SLTechnology News&Howtos

What is the core structure of Apache Spark2.0

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the relevant knowledge of "what is the core structure of Apache Spark2.0". The editor shows you the operation process through an actual case. The operation method is simple, fast and practical. I hope this article "what is the core structure of Apache Spark2.0" can help you solve the problem.

DataFrames, Datasets and Spark SQL

In step 3, you learned about resilient distributed datasets (RDDs)-they form the core data abstraction concepts of Spark and are the basis for all other higher-level data abstractions and API, including DataFrame and datasets.

In Spark2.0, DataFrame and datasets on top of RDDs form the core high-level and structured distributed data abstraction. DataFrame is called data column in Spark, and they can perform plans to organize data, as well as data processing or description operations, and publish queries. The dataset goes a step further, providing a strict compile-time type security, so specific types of errors are found at compile time, not at run time.

With data structures and data types, Spark can understand how you will perform description operations, which columns of specified types or fields with specific names will access your data, and the scope of which specific operations you will use. Then, Spark will optimize your code through Spark 2.0's Catalyst optimizer and generate efficient bytecode through Project Tungsten.

DataFrame and datasets provide API for a variety of high-level programming languages, making your code easier to read, and supporting higher-order functions such as filter, sum, count, avg, min, max, etc. Whether you express your computational instructions in Spark SQL or Python, Java, Scala, or R, the underlying code generation is exactly the same because all execution plans are through the same Catalyst optimizer.

For example, the scope-specific code of Scala or the corresponding query in its SQL will generate exactly the same code. For example, below there will be a dataset Scala project called Person, and an SQL table "Person".

/ / a dataset object Person with field names fname, lname, age, weight// access using object notationval seniorDS = peopleDS.filter (p = > p.age > 55) / / a dataframe with structure with named columns fname, lname, age, weight// access using col name notationVal seniorDF = peopleDF.where (peopleDF ("age") > 55) / / equivalent Spark SQL codeval seniorDF = spark.sql ("SELECT age from person where age > 35")

If you want to know why Spark structured data is important and why DataFrame, dataset, and Spark SQL provide an efficient way to encode Spark, you can find out through https://youtu.be/1a4pgYzeFwE videos.

# # 5. Graphics processing in GraphFrame

Although Spark has a general RDD-based graphics processing library, GraphX, that optimizes distributed computing and supports graphical algorithms, it still has some challenges-it is based on low-level RDD API without Java and Python API. Because of these problems, it cannot enjoy the recently introduced performance optimizations through Project Tungsten and Catalyst Optimizer.

In contrast, the DataFrame-based graph processing library GraphFrames solves all the problems: it provides a library similar to GraphX but with a higher level, more readable and readable API, supports Java, Scala and Python; can save and download graphics; takes advantage of the underlying performance of Spark2.0 and query optimization. In addition, it integrates GraphX. This means that you can seamlessly convert the graph processing library GraphFrames into an equivalent GraphX representation.

In the image below, these cities have individual airport codes, and all vertices can be represented as rows of DataFrame; similarly, all edges can be thought of as rows of DataFrame, with columns of their own name and type. In general, the vertices and edges of these DataFrame form a graph processing library GraphFrames.

/ / create a Vertices DataFrameval vertices = spark.createDataFrame (List (("JFK", "New York", "NY")) .toDF ("id", "city", "state") / / create an Edges DataFrameval edges = spark.createDataFrame (List (("JFK", "SEA", 45, 1058923)). ToDF ("src", "dst", "delay", "tripID") / / create a GraphFrame and use its APIsval airportGF = GraphFrame (vertices Edges) / / filter all vertices from the GraphFrame with delays greater an 30 minsval delayDF = airportGF.edges.filter ("delay > 30") / / Using PageRank algorithm, determine the Airport ranking of importanceval pageRanksGF = airportGF.pageRank.resetProbability (0.15) .maxIter (5). Run () display (pageRanksGF.vertices.orderBy (desc ("pagerank")

Using GraphFrame, you can express three powerful queries. The first is a simple SQL-type query about points and edges, such as what kind of route can cause significant delays. Second, query the graphic type, such as how many vertices are coming in and how many edges are coming out. Third, the topic query finds the model of the dataset in the graph by providing a structured model or the vertices and edges of the path.

In addition, the graph processing library GraphFrames can easily support all GraphX graphics algorithms. For example, use PageRank to find all the important points, or determine the shortest path from the starting point to the destination, or perform a breadth-first search (BFS), or identify strong connection points for exploring connections.

At the webinar (http://go.databricks.com/graphframes-dataframe-based-graphs-for-apache-spark), Joseph Bradley, a community contributor to Spark, introduced the motivation and ease of use of the graph processing library GraphFrames for image processing, as well as the benefits of DataFrame-based API. As part of the seminar, you will also learn about the convenience of using the graph processing library GraphFrames, as well as all the above types of queries and algorithms.

Apache Spark 2.0 and many components of Spark, including machine learning MLlib and Streaming, increasingly tend to provide equivalent DataFrame API because of performance improvements, ease of use, and high-level abstractions and structures. In the necessary or appropriate use cases, you can choose to use the diagram processing library GraphFrames instead of GraphX. The following figure is a concise summary and comparison between GraphX and the graph processing library GraphFrames.

The image processing library GraphFrames is bound to grow faster and faster. The new version of GraphFrame will be compatible with Spark2.0 as a package of Spark.

This is the end of the content about "what is the core structure of Apache Spark2.0". Thank you for your reading. If you want to know more about the industry, you can follow the industry information channel. The editor will update different knowledge points for you every day.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.