Spark Series (1)-- A brief introduction to Spark 04/21 Update SLTechnology News&Howtos

Spark Series (1)-- A brief introduction to Spark

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

I. brief introduction

Spark was born in 2009 at the University of California, Berkeley and was donated to the Apache Software Foundation in AMPLab,2013 and became a top project of Apache in February 2014. Compared with MapReduce batch computing, Spark can bring hundreds of times performance improvement, so it has become the most widely used distributed computing framework after MapReduce.

2. Characteristics

Apache Spark has the following characteristics:

Use advanced DAG scheduler, query optimizer and physical execution engine for performance assurance; multilingual support, currently supported by Java,Scala,Python and R; provide more than 80 high-level API to easily build applications; support batch processing, streaming and complex business analysis; rich class library support: includes libraries such as SQL,MLlib,GraphX and Spark Streaming, and can be seamlessly combined Rich deployment model: support local mode and built-in cluster mode, as well as running on Hadoop,Mesos,Kubernetes; multiple data sources support: support access to data from HDFS,Alluxio,Cassandra,HBase,Hive and hundreds of other data sources.

Third, cluster architecture Term (terminology) Meaning (meaning) ApplicationSpark application, which consists of a Driver node and multiple Executor nodes on the cluster. Driver program main application, which runs the main () method of the application and creates the SparkContextCluster manager cluster resource manager (for example, Standlone Manager,Mesos,YARN) Worker node the work node of the computing task Executor is located on the worker node, responsible for performing the computing task and saving the output data to memory or to the disk Task is sent to the unit of work in Executor

Execution process:

After the user program creates the SparkContext, it connects to the cluster resource manager, which allocates computing resources to the user program, starts Executor;Dirver to divide the computing program into different execution phases and multiple Task, and then sends the Task to the Executor;Executor responsible for executing the Task and reports the execution status to the Driver. At the same time, it also reports the use of the current node resources to the cluster resource manager. IV. Core components

Spark extends four core components based on Spark Core, which are used to meet the computing needs of different fields.

3.1 Spark SQL

Spark SQL is mainly used to deal with structured data. It has the following characteristics:

Can seamlessly mix SQL queries with Spark procedures, allowing you to query structured data using SQL or DataFrame API; support a variety of data sources, including Hive,Avro,Parquet,ORC,JSON and JDBC; support HiveQL syntax and user-defined functions (UDF), allowing you to access the existing Hive repository; support standard JDBC and ODBC connections; support optimizer, column storage and code generation features to improve query efficiency. 3.2 Spark Streaming

Spark Streaming is mainly used to quickly build scalable, high-throughput, high fault-tolerant stream processors. Support for reading and processing data from HDFS,Flume,Kafka,Twitter and ZeroMQ.

The essence of Spark Streaming is micro-batch processing, which splits the data stream into multiple batches, which is close to the effect of stream processing.

3.3 MLlib

MLlib is the machine learning library of Spark. Its design goal is to make machine learning simple and scalable. It provides the following tools:

Common machine learning algorithms: such as classification, regression, clustering and collaborative filtering; characterization: feature extraction, transformation, reduction and dimensional selection; Pipeline: tools for building, evaluating and tuning ML pipelines; persistence: saving and loading algorithms, models, pipeline data; utilities: linear algebra, statistics, data processing, etc. 3.4 Graphx

GraphX is a new component of Spark for graphical computing and graphical parallel computing. At a high level, GraphX extends RDD (a directional multigraph with attributes attached to each vertex and edge) by introducing a new graphical abstraction. To support graph computing, GraphX provides a set of basic operators (such as subgraph,joinVertices and aggregateMessages) and optimized Pregel API. In addition, GraphX includes more and more graphics algorithms and builders to simplify graphics analysis tasks.

For more articles in big data's series, please see the GitHub Open Source Project: big data's getting started Guide.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.