Spark ecosystem components 07/04 Update SLTechnology News&Howtos

Spark ecosystem components

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

Abstract: with the development of big data technology, real-time stream computing, machine learning, graph computing and other fields have become hot research directions, while Spark, as a "sharp weapon" dealt with by big data, has a more mature ecological circle and can solve the problems of similar scenes in one stop. Do you know what components the Spark ecosystem has? Let's take a look at these indispensable components along with this article. This article is selected from "illustrating Spark: core Technology and case practice"

With Spark Core as the core, Spark ecosystem can read data sources such as traditional files (such as text files), HDFS, Amazon S3, Alluxio and NoSQL, and use resources such as Standalone, YARN and Mesos to schedule and manage applications. These applications come from different components of Spark, such as Spark Shell or Spark Submit interactive batch processing, Spark Streaming's real-time streaming application, Spark SQL's impromptu query, sampling approximate query engine BlinkDB's trade-off query, MLbase/MLlib 's machine learning, GraphX's graph processing and SparkR's mathematical calculation, as shown in the following figure, it is this ecosystem that achieves the "One Stack to Rule Them All" goal.

Spark Core

Spark Core is the core component of the whole BDAS ecosystem and a distributed big data processing framework. Spark Core provides a variety of resource scheduling management, through memory computing, directed acyclic graph (DAG) and other mechanisms to ensure the speed of distributed computing, and introduces the abstraction of RDD to ensure high fault tolerance of data, its important features are described as follows.

Spark Core provides a variety of running modes, not only can use its own running mode to deal with tasks, such as local mode, Standalone, but also can use third-party resource scheduling framework to deal with tasks, such as YARN, MESOS and so on. In comparison, the third-party resource scheduling framework can manage resources with finer granularity.

Spark Core provides a distributed parallel computing framework based on directed acyclic graph (DAG), and provides a memory mechanism to support multiple iterations or data sharing, which greatly reduces the overhead of reading data between iterations, which greatly improves the performance of data mining and analysis that require multiple iterations. In addition, by moving computing rather than moving data during task processing, RDDPartition can read data blocks from the distributed file system to the memory of each node for computing.

The abstraction of RDD is introduced in Spark, which is a set of read-only objects distributed in a group of nodes. These sets are elastic. If part of the data set is lost, it can be reconstructed according to the "lineage" to ensure the high fault tolerance of the data.

Spark Streaming

Spark Streaming is a streaming system with high throughput and high fault tolerance for real-time data streams. It can perform complex operations such as Map, Reduce and Join on a variety of data sources (such as Kafka, Flume, Twitter and ZeroMQ), and save the results to external file systems, databases or real-time dashboards, as shown in the following figure.

Compared with other processing engines, they either focus on flow processing or are only responsible for batch processing (only flow processing API interfaces that require external implementation are provided), while the biggest advantage of Spark Streaming is that the processing engine and RDD programming model provide both batch processing and flow processing at the same time.

For the traditional stream processing method of processing one record at a time, Spark Streaming uses the stream data discretization processing (Discretized Streams), which enables batch processing of data below the second level. During SparkStreaming processing, Receiver receives the data in parallel and caches the data into the memory of the Spark worker node. After latency optimization, the Spark engine can batch short tasks (tens of milliseconds) and output the results to other systems. Different from the traditional continuous operator model, the model is statically assigned to a node for computing, while Spark can be dynamically allocated to work nodes based on the source of data and available resources.

Using discrete stream data (DStreaming), Spark Streaming will have the following characteristics.

Dynamic load balancing: Spark Streaming divides data into small batches, which enables finer-grained allocation of resources. For example, in the traditional real-time stream recording processing system, when the input data stream is partitioned with key values, if the computing pressure of a node exceeds the load, the node will become a bottleneck and slow down the processing speed of the whole system. In Spark Streaming, job tasks will be dynamically balanced and assigned to each node, as shown in the figure, that is, if the task processing time is longer, the number of assigned tasks will be less; if the task processing time is shorter, the assigned task data will be more.

Fast fault recovery mechanism: in the case of node failure, the traditional flow processing system will restart the failed continuous operator on other nodes, and may rerun the previous data flow processing operation to obtain some lost data. In this process, only the node reprocesses the failed process, and only after the new node completes all the calculations before the failure, the whole system can handle other tasks. In Spark, computing is divided into many small tasks to ensure that they can be merged correctly after any node is running. Therefore, in the case of a failure of a node, the tasks of this node will be evenly distributed to the nodes in the cluster for calculation, which can recover faster than the transfer fault recovery mechanism.

The integration of batch processing, streaming processing and interactive analysis: Spark Streaming decomposes streaming computing into a series of short batch jobs, that is, the input data of Spark Streaming is divided into discrete data streams (DStream) according to batch size (such as seconds), each piece of data is converted into RDD in Spark, and then the DStream stream operation in Spark Streaming is changed into batch operation in Spark. In addition, the stream data is stored in the memory of the Spark node, so that users can query interactively according to their needs. It is the working mechanism of Spark that combines batch processing, streaming, and interactive work.

Spark SQL

Spark SQL, formerly known as Shark, was released with Hive as the only option for SQL on Hadoop (Hive is responsible for compiling SQL into extensible MapReduce jobs), and given the performance of Hive and its compatibility with Spark, Shark was born.

Shark, or Hive on Spark, is essentially parsed through the HQL of Hive, translating HQL into the corresponding RDD operation on Spark, and then obtaining the table information in the database through the Metadata of Hive, which is actually the data and files on HDFS, and finally obtained by Shark and calculated on Spark. The most important feature of Shark is its high speed and complete compatibility with Hive, and it can use API such as rdd2sql in Shell mode to continue to operate the result set obtained by HQL in Scala environment, supporting users to write simple machine learning or simple analysis processing functions to further analyze and calculate the HQL results.

At Spark Summit on July 1, 2014, Databricks announced the end of development of Shark and focused on Spark SQL. At this meeting, Databricks said that Shark is more of a modification of Hive, replacing Hive's physical execution engine to give it a faster processing speed. However, it can not be ignored that Shark inherits a lot of Hive code, which brings a lot of trouble to optimization and maintenance. With the further deepening of performance optimization and advanced analysis integration, the part of MapReduce-based design has undoubtedly become the bottleneck of the whole project. Therefore, in order to better develop and provide users with a better experience, Databricks announced the termination of the Shark project, thus putting more energy on Spark SQL.

Spark SQL allows developers to work with RDD directly, while also querying external data that exists on Hive. An important feature of SparkSQL is its ability to handle relational tables and RDD in a unified manner, which makes it easy for developers to use SQL commands for external queries and more complex data analysis.

Spark SQL has the following characteristics.

A new RDD type, SchemaRDD, is introduced, which can define SchemaRDD like traditional database definition tables. SchemaRDD consists of row objects that define column data types. SchemaRDD can be converted from RDD, read from Parquet files, and can be obtained from Hive using HiveQL.

Embedded Catalyst query optimization framework, after parsing SQL into a logical execution plan, using some classes and interfaces in the Catalyst package, perform some simple execution plan optimization, and finally become the calculation of RDD.

You can mix data from different sources in your application, for example, you can Join data from HiveQL and data from SQL. With the emergence of Shark, the performance of SQL-on-Hadoop is 10 to 100 times higher than that of Hive, so what is the performance of Spark SQL without the restriction of Hive? Although not as impressive as Shark's performance improvement over Hive, it also performs well, as shown in the figure (where the data on the right is Spark SQL).

Why is the performance of Spark SQL so much improved? The main reason is that Spark SQL has optimized in the following points.

Memory column storage (In-Memory Columnar Storage): the table data of Spark SQL is stored in memory instead of the original JVM object storage.

Bytecode generation technology (Bytecode Generation): Spark 1.1.0 adds Codegen module to the Expressions of Catalyst module, uses dynamic bytecode generation technology, and dynamically compiles matching expressions with specific code. In addition, CG optimizations have been made to SQL expressions. The implementation of CG optimization mainly depends on the reflection mechanism (Runtime Reflection) of Scala 2.10 runtime.

Scala code optimization: when writing code in Scala, Spark SQL tries to avoid inefficient, easy-to-GC code; although it increases the difficulty of writing code, the interface is uniform for users.

BlinkDB

BlinkDB is a massively parallel query engine used to run interactive SQL queries on massive data. It allows users to improve query response time by weighing the accuracy of data, and the accuracy of its data is controlled within the allowable error range. To achieve this goal, BlinkDB uses the following core ideas:

Adaptive optimization framework to build and maintain a set of multi-dimensional samples from the original data over time.

Dynamic sample selection strategy, select an example of the appropriate size, which is based on the accuracy of the query and the urgency of response time. Different from the traditional relational database, BlinkDB is an interactive query system, like a seesaw, users need to make a tradeoff between query accuracy and query time; if users want to obtain query results faster, they will sacrifice the accuracy of query results; on the contrary, if users want to obtain higher precision query results, they need to sacrifice query response time. The following figure shows the BlinkDB architecture.

MLBase/MLlib

MLBase is a machine learning-focused component of the Spark ecosystem, and its goal is to lower the barriers to machine learning and make it easy for users who may not know about machine learning to use MLBase. MLBase is divided into four parts: MLRuntime, MLlib, MLI and ML Optimizer.

MLRuntime: a distributed memory computing framework provided by Spark Core that runs algorithms optimized by Optimizer to calculate data and output analysis results.

MLlib: Spark implements some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction and underlying optimization. The algorithm can be expanded.

MLI: an API or platform for feature extraction and high-level ML programming abstract algorithms.

MLOptimizer: will select the machine learning algorithm and related parameters that it thinks are most suitable for internal implementation to process the data entered by the user and return the model or other results that help analyze.

The core of MLBase is its optimizer (ML Optimizer), which can transform declarative tasks into complex learning plans and finally produce optimal models and computational results. MLBase is different from other machine learning Weka and Mahout, each has its own characteristics, the specific contents are as follows.

MLBase is based on Spark, which uses distributed memory computing; Weka is a stand-alone system, while Mahout uses MapReduce for data processing (Mahout is moving to using Spark for data processing).

MLBase is processed automatically; both Weka and Mahout require users to have machine learning skills to choose the algorithms and parameters they want to process.

MLBase provides interfaces with different degrees of abstraction, which can be used by users to extend the algorithm.

GraphX

GraphX was originally a distributed graph computing framework project of Berkeley AMP Lab, and later integrated into Spark as a core component. It is the API used for parallel computing of graphs and graphs in Spark, and it can be regarded as the rewriting and optimization of GraphLab and Pregel on Spark. Compared with other distributed graph computing frameworks, the biggest advantage of GraphX is that it provides a stack of data solutions based on Spark, which can efficiently complete the complete flow operation of graph computing.

The core abstraction of GraphX is Resilient Distributed Property Graph, a directed multigraph with attributes on both vertices and edges. GraphX extends the abstraction of Spark RDD, which has two views, Table and Graph, but requires only one physical storage, and both views have their own unique operators, resulting in flexible operation and execution efficiency. Most of the implementations in the overall architecture of GraphX are carried out around the optimization of Partition, which to some extent shows that the storage of point segmentation and the corresponding computational optimization are indeed the key and difficult points of the graph computing framework.

The underlying design of GraphX has the following key points.

(1) all operations on the Graph view will eventually be converted into the RDD operation of its associated Table view. In this way, the calculation of a graph is logically equivalent to a series of RDD conversion processes. Therefore, Graph finally has three key features of RDD: Immutable, Distributed, and Fault-Tolerant. The most important of these is Immutable (invariance). Logically, the transformation and operation of all graphs produce a new graph; physically, GraphX will have a certain degree of reuse optimization of invariant vertices and edges, which is transparent to users.

(2) the physical data shared at the bottom of the two views consists of two RDD, RDD [Vertex-Partition] and RDD [EdgePartition]. Points and edges are not actually stored in the form of table Collection [tuple], but VertexPartition/EdgePartition stores a fragment of data with an indexed structure internally to speed up traversal in different views. The constant index structure is shared in the RDD conversion process, which reduces the computing and storage overhead.

(3) the distributed storage of the graph adopts the point partition mode, and uses the partitionBy method, and different partition strategies (PartitionStrategy) are specified by the user. The partitioning strategy assigns edges to each EdgePartition, and vertex Master assignment to each VertexPartition,EdgePartition caches the Ghost copy of the local border point. Different partition strategies will affect the number of Ghost copies to be cached and the balance of the edges assigned by each EdgePartition. It is necessary to choose the best strategy according to the structural characteristics of the graph.

SparkR

R is an open source and free software that follows the GNU protocol, which is widely used in statistical calculation and statistical mapping, but it can only be run on a stand-alone machine. In order to be able to use R language to analyze large-scale distributed data, AMP Lab at Berkeley developed SparkR and added this component to Spark 1.4. Through SparkR, you can analyze large data sets and run jobs interactively on SparkR through R Shell. SparkR features are as follows:

The API of flexible distributed dataset (RDDs) in Spark is provided, and users can interactively run Spark tasks on the cluster through R Shell.

Support for ordering closures, which can automatically serialize variables referenced in user-defined functions to other machines in the cluster.

SparkR can also easily invoke the R development package, as long as you read the R development package with includePackage before performing operations on the cluster.

Below is a schematic diagram of the processing flow of SparkR.

Alluxio

Alluxio is a distributed memory file system, which is a highly fault-tolerant distributed file system that allows files to be reliably shared in a cluster framework at memory speed, just like Spark and MapReduce. Alluxio is a kind of middleware between distributed file storage at the bottom and various computing frameworks at the upper level. Its main responsibility is to land the files that do not need to be landed in the DFS into the distributed memory file system to achieve shared memory and improve efficiency. At the same time, it can reduce memory redundancy, GC time and so on.

Similar to Hadoop, the architecture of Alluxio is the traditional Master-Slave architecture, and all Alluxio Worker are managed by Alluxio Master. Alluxio Master determines whether Worker has crashed and the amount of memory space left by each Worker by the heartbeat sent out by Alluxio Worker. In order to prevent single point problems, ZooKeeper is used as HA.

Alluxio has the following features.

AVA-Like File API:Alluxio provides API similar to the Java File class.

Compatibility: Alluxio implements the HDFS interface, so Spark and MapReduce programs can run without any modification.

Pluggable underlying file system: Alluxio is a pluggable underlying file system that provides fault tolerance and records memory data in the underlying file system. It has a common interface that can be easily plugged into different underlying file systems. HDFS, S3, GlusterFS, and single-node local file systems are currently supported, and more file systems will be supported in the future. The applications supported by Alluxio are as follows.

This article is selected from "illustrating Spark: core Technology and case practice". For more wonderful articles in time, you can search for "blog Viewpoint" in Wechat or scan the QR code below and follow.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.