In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-05 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/03 Report--
1. The core concept of spark (1) Application
represents the application and contains a Driver Program and several Executor. (spark code written)
(2) Driver program
The Driver in Spark runs the main () function of the above Application and creates the SparkContext, where the SparkContext is created to prepare the environment in which the Spark application runs. SparkContext is responsible for communicating with ClusterManager, applying for resources, assigning and monitoring tasks, etc. Close SparkContext after the execution of the program.
(3) Cluster Manager
is the Master (master node) in Standalone mode, which controls the whole cluster and monitors Worker. Resource Manager (resourcemanager) in YARN mode.
(4) Spark Context
the context of the whole application, controls the life cycle of the application, is responsible for scheduling each computing resource and coordinating the Executor on each Worker. During initialization, the two core components, DAGScheduler and TaskScheduler, are initialized.
(5) RDD
The basic computing unit of Spark, a set of RDD can form an executed directed acyclic graph RDD Graph.
(6) DAGScheduler
splits the application into multiple job, builds a DAG for each job, divides the DAG into multiple stage, and finally submits the stage to the TaskScheduler.
(7) TaskScheduler
splits the stage submitted by DAGScheduler into multiple task collections, and then submits the TaskSet to Worker (cluster) to run, where each Executor runs what Task is assigned.
(8) Worker
A node in a cluster that can run Application code. In Standalone mode, it refers to the worker node configured through the slave file, and in Spark on Yarn mode, it refers to the NodeManager node. (that is, the node running Executor)
(9) Executor
A process running on an Application node that is responsible for running some task and for storing data in memory or on disk. In Spark on Yarn mode, the process name is CoarseGrainedExecutorBackend, and a CoarseGrainedExecutorBackend process has one and only one executor object, which is responsible for wrapping Task as taskRunner and extracting an idle thread from the thread pool to run Task, so that the data that each CoarseGrainedExecutorBackend can run Task in parallel depends on the number of CPU assigned to it.
(10) Stage
each Job is split into many groups of Task, each as a TaskSet, whose name is Stage
(11) Job
Parallel computing consisting of multiple Task is triggered by Action behavior. If an action is triggered, it is a job.
(12) SparkEnv
thread-level context that stores references to important components of the runtime. Create and contain references to some of the following important components within the SparkEnv:
MapOutPutTracker: responsible for storing Shuffle meta-information.
BroadcastManager: responsible for controlling broadcast variables and storing meta-information.
MapOutPutTracker: responsible for storage management, creating and finding blocks.
MetricsSystem: monitors runtime performance metrics information.
SparkConf: responsible for storing configuration information.
2. Task running process of spark (1) basic running process:
The first step (build DAG): use the operator operation RDD to perform various transformation operations, and finally trigger the job submission of spark through the action operator. After submission, spark will build a DAG directed acyclic graph based on the dependency relationship between RDD generated during the transformation process.
The second step (DAG cutting): DAG cutting mainly determines the cutting node according to whether the dependency of RDD is a wide dependency, and divides the task into a new scheduling phase (Stage) when it encounters a wide dependency. Each Stage contains one or more Task. These Task will form task sets (TaskSet), which will be submitted to the underlying scheduler for scheduling and running.
Step 3 (task scheduling): each Spark task scheduler serves only one SparkContext instance. When the task scheduler receives the task set, it is responsible for distributing the task set as Task tasks to the Executor process of the Worker node for execution. If a task fails, the task scheduler is responsible for reassigning the calculation of the task.
Step 4 (execute the task): when the Executor receives the sent task, it will perform the calculation of the task in the way of multi-thread (a thread pool will be initialized when executor is started). Each thread is responsible for a task. After the task ends, it will choose the appropriate return method according to the task class type and return the result to the task scheduler. (cluster manager).
(2) generally run the process:
Initialize sparkcontext,sparkcontext to register with the resource manager and apply to run executor resource manager to allocate executor resources and start. Executor operation will be sent to the resource manager with the heartbeat Executor reverse registration to driver, telling driver resources are ready, you can execute the task sparkcontext to build RDD, and then split the RDD into multiple stage through DAGscheduler, and then subpack it into taskset and give it to task schdeluder. Taskscheduler sends task to executor in worker to perform the corresponding task.
Add: what is the difference between client/cluster in spark on yarn?
Spark on yarn-cluster
Spark on yarn-client
Note: in the process of running spark jobs, in general, a large amount of data will interact between driver and the cluster, so if client mode is used, a large amount of network data will be transmitted in the process of running the program, resulting in a surge in network traffic. Based on cluster mode, because driver and appmaster are on the same node, driver itself is in the cluster, so the data transmission is also completed in the cluster. The pressure of network transmission is relatively small. (3) run the process in detail:
According to the information in the above figure:
-start the spark cluster and start the corresponding master and worker in the spark cluster through spark-shell. Master is the manager of the cluster, knowing the number of slave nodes in the cluster, the resources of the slave nodes, and whether the slave nodes survive.
The registration of -worker nodes, when the worker process starts, sends a registration message to the master process, so worker is an event-driven model based on AKKA actor, and so is master. After worker registers successfully, it also sends a heartbeat to master, listens for the existence of the master node, and reports the heartbeat.
-driver submit job: when driver submits a job to the spark cluster, it submits the job to master and registers the resources needed for the spark application. To put it bluntly, it is to apply to master for the resources for the application to run.
-master resource allocation: when driver submits a job request, mater receives the corresponding request and assigns the corresponding job task to the worker node, that is, to start the corresponding executor process in the worker node, executor maintains a thread pool, and the thread in the thread pool actually executes the task task.
-worker startup executor: when the worker node receives the master startup executor, it will launch one or more executor accordingly and report the startup success message to the master, indicating that it can receive the task.
-reverse registration with driver: when the worker node starts executor successfully, it will reverse register with driver, telling driver which executor can receive tasks and perform spark tasks.
-driver receives the registration of worker: after driver receives the registration information of worker, it initializes the corresponding executor_info information. According to the executorid sent by worker, it can determine which executor serves it.
-driver initializes the sparkcontext:sparkcontext to build the RDD, then splits the RDD into multiple stage via DAGscheduler, and then splits it into a taskset and gives it to the task schdeluder.
-taskscheduler sends task to executor in worker to perform the corresponding task
-executor execution task: when the executor process receives the taskset sent by driver, it deserializes and then installs the task into a thread called taskrunner, and then puts it into the local thread pool to schedule the corresponding job execution. When the execution is finished, the result is landed (returned to driver/ printout / saved locally / saved to hdfs...)
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.