How to understand the basic principles of Spark 07/06 Update SLTechnology News&Howtos

How to understand the basic principles of Spark

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to understand the basic principles of Spark. The content of the article is of high quality, so the editor will share it for you as a reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

First, the advantages of Spark

As the successor of big data's computing framework MapReduce, Spark has the following advantages.

1. High efficiency

Different from MapReduce putting the intermediate calculation results into disk, Spark uses memory to store the intermediate calculation results, which reduces the disk IO of iterative operation, and through the optimization of parallel computing DAG diagram, reduces the dependence between different tasks and reduces the delay waiting time. In memory calculation, Spark is 100 times faster than MapReduce.

2. Ease of use

Unlike MapReduce, which only supports Map and Reduce, Spark provides more than 80 different Transformation and Action operators, such as map,reduce,filter,groupByKey,sortByKey,foreach, and adopts functional programming style, which greatly reduces the amount of code needed to achieve the same function.

3, versatility

Spark provides a unified solution. Spark can be used for batch processing, interactive query (Spark SQL), real-time streaming processing (Spark Streaming), machine learning (Spark MLlib), and graph computing (GraphX).

These different types of processing can be used seamlessly in the same application. For enterprise applications, one platform can be used for different engineering implementations, reducing the cost of human development and platform deployment.

4, compatibility

Spark is compatible with many open source projects. For example, Spark can use Hadoop's YARN and Apache Mesos as its resource management and scheduler, and Spark can read a variety of data sources, such as HDFS, HBase, MySQL and so on.

Second, the basic concept of Spark

RDD: short for resilient distributed data set (Resilient Distributed Dataset), it is an abstract concept of distributed memory and provides a highly restricted shared memory model.

DAG: is the abbreviation of Directed Acyclic Graph (directed acyclic graph), which reflects the dependency relationship between RDD.

Driver Program: control program, responsible for building DAG diagrams for Application.

Cluster Manager: cluster resource management center, which is responsible for allocating computing resources.

Worker Node: work node, which is responsible for completing the specific calculation.

Executor: a process running on the work node (Worker Node) that runs Task and stores data for the application.

Application: a user-written Spark application where an Application contains multiple Job.

Job: job. A Job contains multiple RDD and various operations acting on the corresponding RDD.

Stage: phase is the basic scheduling unit of a job. A job is divided into multiple groups of tasks, each of which is called a phase.

Task: task, a unit of work that runs on Executor, is a thread in Executor.

Summary: Application consists of multiple Job, Job consists of multiple Stage, and Stage consists of multiple Task. Stage is the basic unit of job scheduling.

Third, Spark architecture design

The Spark cluster consists of Driver, Cluster Manager (Standalone,Yarn or Mesos), and Worker Node. For each Spark application, there is an Executor process on the Worker Node, and the Executor process includes multiple task threads.

For pyspark, Spark wraps a layer of Python API on the perimeter in order not to break Spark's existing runtime architecture. On the driver side, the interaction between Python and Java is realized with the help of Py4j, and then the Spark application is written through Python. On the executor side, there is no need for Py4j, because the Task logic running on the executor side is sent by Driver, which is the serialized bytecode.

Fourth, Spark running process

1 DAG Application is first constructed by Driver and decomposed into Stage.

2, and then Driver requests resources from Cluster Manager.

3Giver Cluster Manager sends conscription signals to some Work Node.

4. The enlisted Work Node starts the Executor process to respond to the conscription and applies to the Driver for a task.

5Making driver assigns Task to Work Node.

6Magical executor executes Task in the unit of Stage, during which Driver carries out monitoring.

7 after receiving the signal that the Executor task is completed, the Japanese driver sends a logout signal to the Cluster Manager.

The 8cr Cluster Manager sends a release resource signal to Work Node.

9 Executor work Node stops running.

Fifth, Spark deployment model

Local: local operation mode, non-distributed.

Standalone: Spark has its own cluster manager, and only Spark tasks can be run after deployment.

Yarn:Haoop cluster manager, after deployment, you can run various tasks such as MapReduce,Spark,Storm,Hbase.

Mesos: the biggest difference from Yarn is that Mesos allocates resources twice, Mesos allocates once, and the computing framework can choose to accept or reject it.

Sixth, RDD data structure

RDD, whose full name is Resilient Distributed Dataset, is a flexible distributed dataset. It is a read-only partition collection of records and the basic data structure of Spark.

RDD represents an immutable, partitioned collection of elements that can be computed in parallel.

There are generally two ways to create a RDD, the first is to read the data in the file to generate RDD, and the second is to parallelize the objects in memory to get RDD.

# generate RDD by reading files

Rdd = sc.textFile ("hdfs://hans/data_warehouse/test/data")

# get RDD by parallelizing objects in memory

Arr = [1, 2, 3, 4, 5]

Rdd = sc.parallelize (arr)

After you create the RDD, you can program the RDD using a variety of actions.

There are two types of RDD operations, namely, Transformation operations and Action operations. The transformation operation creates a new RDD from the existing RDD, while the action operation returns the result to Driver after the calculation is made on the RDD.

All Transformation operations have the feature of Lazy, that is, Spark will not immediately carry out the actual calculation, but will only record the track of execution. Only when the Action operation is triggered will it really be executed according to the DAG diagram.

The operation determines the dependencies between RDD.

There are two types of dependencies between RDD: narrow dependency and wide dependency. When the dependency is narrow, the relationship between the partition of the parent RDD and the partition of the child RDD is one-to-one or many-to-one. In the case of wide dependencies, the partition of the parent RDD and the partition of the self-RDD are one-to-many or many-to-many.

Operations related to wide dependencies generally have a shuffle procedure, which distributes different records of the key on each partition in the parent RDD to different child RDD partitions through a Patitioner function.

Dependencies determine how DAG is split into Stage.

Cutting rule: from back to front, cut Stage when you encounter wide dependencies.

The dependency relationship between RDD forms a DAG directed acyclic graph, and the DAG will be submitted to DAGScheduler,DAGScheduler. The DAG will be divided into multiple stage that depend on each other. The division of stage is based on the wide and narrow dependence between RDD. Stage is divided when you encounter wide dependencies, and each stage contains one or more task tasks. Then submit these task as taskSet to TaskScheduler to run.

Seventh, WordCount example import findspark

# specify spark_home as the decompression path and python path

Spark_home = "/ Users/liangyun/ProgramFiles/spark-3.0.1-bin-hadoop3.2"

Python_path = "/ Users/liangyun/anaconda3/bin/python"

Findspark.init (spark_home,python_path)

Import pyspark

From pyspark import SparkContext, SparkConf

Conf = SparkConf () .setAppName ("test") .setMaster ("local [4]")

Sc = SparkContext (conf=conf)

It only takes 5 lines of code to complete WordCount word frequency statistics.

Rdd_line = sc.textFile (". / data/hello.txt")

Rdd_word = rdd_line.flatMap (lambda x:x.split (""))

Rdd_one = rdd_word.map (lambda t: (tjin1))

Rdd_count = rdd_one.reduceByKey (lambda xonomer yanghuangy)

Rdd_count.collect ()

[('world', 1)

('love', 3)

('jupyter', 1)

('pandas', 1)

('hello', 2)

('spark', 4)

('sql', 1)] this is the end of sharing the basic principles of how to understand Spark. I hope the above content can be helpful to you and learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.