Summary of spark- usage-introduction to big data 04/29 Update SLTechnology News&Howtos

Summary of spark- usage-introduction to big data

2025-04-29 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

1. Number of partition

The input to the spark may be stored on the HDFS in the form of multiple files, and each File contains many blocks called Block.

When Spark reads these files as input, it will parse them according to the InputFormat corresponding to the specific data format. Generally, several Block are merged into one input shard, called InputSplit. Note that InputSplit cannot span files.

A specific Task will then be generated for these input shards. InputSplit and Task have an one-to-one correspondence.

Each of these specific Task is then assigned to an Executor of a node on the cluster to execute.

Each node can have one or more Executor.

Each Executor consists of several core, and each core of each Executor can only execute one Task at a time.

The result of each Task execution is to generate a partiton of the target RDD.

Note: the core here is a virtual core rather than the physical CPU core of the machine, which can be understood as a worker thread of Executor.

The degree of concurrency of Task executed = number of Executor * number of cores per Executor.

As for the number of partition:

For the data read phase, such as sc.textFile, the initial InputSplit is required as much as the input file is divided into.

The number of partition remains the same during the Map phase.

In the Reduce phase, the aggregation of RDD triggers the shuffle operation. The number of partition of the aggregated RDD is related to the specific operation. For example, the repartition operation converges and synthesizes the specified number of partitions, and some operators are configurable.

2. Comparison of spark deployment models

The three deployment models are compared in this blog. Please refer to the deployment model comparison: it is summarized as follows:

Mesos seems to be a better choice for Spark and is officially recommended.

But if you run hadoop and Spark at the same time, in terms of compatibility, Yarn seems to be the better choice, after all, it is natural. Spark on Yarn is also running well.

If you not only run hadoop,spark. Also running docker,Mesos on resource management seems to be more generic.

Standalone small-scale computing cluster seems to be more suitable!

For the comparison between client and cluster in yarn mode, please refer to the comparison between client and cluster:

Before you understand the deep difference between YARN-Client and YARN-Cluster, let's make one concept clear: Application Master. In YARN, each Application instance has an ApplicationMaster process, which is the first container that Application starts. It is responsible for dealing with ResourceManager and requesting resources, and after getting the resources, it tells NodeManager to start Container for it. In a deep sense, the difference between YARN-Cluster and YARN-Client mode is actually the difference of ApplicationMaster process.

In YARN-Cluster mode, Driver runs in AM (Application Master), which is responsible for requesting resources from YARN and supervising the operation of jobs. After the user has submitted the job, Client can be turned off and the job will continue to run on YARN, so YARN-Cluster mode is not suitable for running interactive jobs.

In YARN-Client mode, Application Master only asks YARN that Executor,Client will communicate with the requested Container to schedule their work, that is, Client cannot leave.

(1) the Driver of YarnCluster is on a certain NM of the cluster, but the Yarn-Client is on the machine of RM.

(2) Driver communicates with Executors, so Yarn_cluster can turn off Client after submitting App, but Yarn-Client cannot

(3) Yarn-Cluster is suitable for production environment, and Yarn-Client is suitable for interaction and debugging.

3. The operation principle of spark

The spark application performs various transformation calculations and finally triggers the job through action. After the submission, the SparkContext is constructed, and the DAG diagram is constructed according to the dependency relationship of the RDD through sparkContext, and the DAG diagram is submitted to the DAGScheduler for parsing. The parsing takes shuffle as the boundary, reverse parsing, and there is also a dependency relationship between the construction stage,stage. This process is to parse the DAG diagram into stage, and calculate the dependency relationship between each stage. Stage submits the TaskSet to TaskScheduler in stageSet, and then submits each TaskSet to the underlying scheduler. In spark, it submits to taskScheduler for processing, generates TaskSet manager, and finally submits to executor for calculation, executor multithreaded calculation. After completing the task task, the completion information is submitted to schedulerBackend, which submits the task completion information to TaskScheduler. TaskScheduler gives feedback to TaskSetManager, deletes the task task, and performs the next task. At the same time, TaskScheduler inserts the completed result into the success queue and returns the information of success after joining. TaskScheduler sends information about the success of the task to TaskSet Manager. After all the tasks are completed, TaskSet Manager will feed back the results to DAGScheduler. If it belongs to resultTask, give it to JobListener. If it does not belong to resultTask, save the results. Write data after all running.

Many people know that I have big data training materials, and they naively think that I have a full set of big data development, hadoop, spark and other video learning materials. I would like to say that you are right. I do have a full set of video materials developed by big data, hadoop and spark.

If you are interested in big data development, you can add a group to get free learning materials: 763835121

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.