Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

Source Code Analysis of the running process of spark tasks

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/03 Report--

Source Code Analysis of spark Task running

The writing, submission and execution of the entire spark task is divided into three parts:

① write programs and submit tasks to the cluster

Initialization of ② sparkContext

③ triggers the runJob method in the action operator to perform the task

(1) Program and submit to the cluster:

The code of ① programming spark program

② is packed into a jar package to run in the cluster.

③ uses the spark-submit command to submit tasks

When submitting a task, you need to specify-- the entry of the class program (a class with main methods)

1) spark-submit-class xxx

2) ${SPARK_HOME} / bin/spark-class org.apache.spark.deploy.SparkSubmit $@

3) org.apache.spark.launcher.Main

Submit (appArgs, uninitLog)

DoRunMain ()

RunMain (childArgs, childClasspath, sparkConf, childMainClass, args.verbose)

ChildMainClass: … . / .WordCount (the main class of your own code)

MainClass = Utils.classForName (childMainClass)

Val app: SparkApplication = if () {} else {new JavaMainApplication (mainClass)}

App.start (childArgs.toArray, sparkConf) / / call mainClass execution through reflection

/ / so far, it is equivalent to calling the main method of our own task class to execute.!

Val mainMethod = klass.getMethod ("main", new ArrayString.getClass)

MainMethod.invoke (null, args)

④ began to execute the code he wrote.

(2) initialize sparkContext:

When the program you write runs to: new SparkContext (), you begin the subtle and meticulous initialization of sparkContext.

Introduction to sparkContext: sparkContext is the only entry for users to the spark cluster and can be used to create RDD, accumulator, and broadcast variables in the spark cluster. SparkContext is also a vital object of the entire spark application, and it is the core of the scheduling of the entire application (not the core of resource scheduling). When sparkContext is initialized, DAGScheduler, TaskScheduler, and SchedulerBackend, critical objects, are initialized at the same time.

The construction process of sparkContext:

1) the code executed on the Driver:

  initializes TaskScheduler

  initializes SchedulerBackend

  initializes DAGScheduler

2) the code executed by worker and master:

  driver applies for resources with master for registration.

  Worker is responsible for starting executor.

(3) trigger the runJob method in the action operator:

Summary of spark task running: break the written program into a jar package and call the spark-submit submission task to run the main method of running sparkSubmit on the cluster, in which we create an instance object of the main class by reflection, and then call the main method of the object to start executing the code we wrote when the code runs to the new SparkContext object. Start the initialization of complex and elaborate sparkContext objects. When initializing SparkContext objects, two particularly important objects are created: DAGScheduler and TaskScheduler, in which [the role of DAGScheduler] cuts the dependency of RDD into a stage, and then stage is submitted to Taskscheduler as taskSet. When building the TaskScheduler, two very important objects will be created, namely, DriverActor and ClientActor,DriverActor are responsible for receiving the reverse registration of executor and submitting the task to executor to run. ClientActor is responsible for registering with master and submitting the task. When clientActor starts, it will sub-pack the parameters related to the task submitted by the user into the applicationDescription object, and then submit them to master for task registration. When master receives the task request submitted by clientActor, it will analyze the requested parameters. And encapsulate it into application, then persist it, and then add it to the task queue waitingApps. When it is our turn to submit the task, we start to execute schedule () to schedule task resources. When worker receives the launchExecutor sent by master, it decompresses and encapsulates it into ExecutorRunner, and then calls the start method of this object. After starting executorexecutor, it will reverse register to driver. Driver will send registration success information to executorexecutor. After receiving driver actor registration success information to executorexecutor, a thread pool will be created. Used to execute the task sent by driveractor when all the Executor belonging to the task is started and registered in reverse, it means that the environment in which the task is run is ready, and driver will end the initialization of the SparkContext object, which means that the new SparkContext code is finished. When the sparkContext initialization is complete, it will continue to run our code until it reaches the action operator. This means that the submission driver that triggers a job will submit the job to the job that DAGSchedulerDAGScheduler will receive, starting with the last operator, dividing the DAG into stage according to dependencies, then encapsulating the stage into a taskSet, and submitting the task in the taskSet to taskSchedulertaskScheduler to receive the task sent by DAGScheduler, will get a serializer, serialize the task, and then package the serialized task into launchTask Then send the launchTask to the specified executor when the executor receives the launchTask sent by the DriverActor, it will get a deserializer, deserialize the launchTask, encapsulate it into a TaskRunner, and then get a thread from the thread pool of executor to apply the operator in the deserialized task to the partition corresponding to the RDD. Finally, when all the task tasks are complete, the entire application execution is complete and the sparkContext object is closed.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report