Spark (2): spark architecture and physical implementation diagram 04/16 Update SLTechnology News&Howtos

Spark (2): spark architecture and physical implementation diagram

2025-04-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/03 Report--

The figure above is a flow chart of job submission. The specific steps for job submission are as follows

Once action is available, DagScheduler.runJob is triggered to submit the task, mainly by logically executing the diagram DAG, and then calling finalStage = newStage () to divide the stage. When new Stage (), finalRDD's getParentStages () is called; getParentStages () starts from finalRDD, reverses visit logic to execute the diagram, encounters NarrowDependency, adds dependent RDD to stage, encounters ShuffleDependency, cuts stage, and recurses to ShuffleDepedency dependent stage. After a ShuffleMapStage (not the stage that finally formed the result) is formed, the last RDD of the stage is registered with the MapOutputTrackerMaster.registerShuffle (shuffleDep.shuffleId, rdd.partitions.size), which is important because the shuffle process requires MapOutputTrackerMaster to indicate the location of the ShuffleMapTask output data. Then submitStage (finalStage) determines the missingParentStages of the stage first, using getMissingParentStages (stage). If the parentStages may have already been executed, it will be empty. If the missingParentStages is not empty, then recursively submit the parent stages of the missing and add yourself to the waitingStages. When the execution of the parent stages finishes, the stage in the submission waitingStages will be triggered. If the missingParentStages is empty, which means that the stage can be executed immediately, then submitMissingTasks (stage, jobId) is called to generate and submit the specific task. If stage is ShuffleMapStage, then new comes out with the same number of partition as the last RDD of that stage. If stage is ResultStage, then new comes out with the same number of partition as the last RDD of stage. The task in a stage forms a TaskSet, and finally calls taskScheduler.submitTasks (taskSet) to submit an entire taskSet. TaskScheduler sends the task to the DriverActor process, and the DriverActor sequence is then sent to exector for real execution.

The figure above shows the task execution process, which is as follows

After the Worker side receives the tasks, executor wraps the task as taskRunner and pulls an idle thread from the thread pool to run task. After Executor receives the task of serialized, it first deserialize the normal task, and then runs task to get its execution result directResult, which is sent back to driver. If the result is relatively large (such as groupByKey's result), the result is first stored on the local "memory + disk", which is managed by blockManager, and only the storage location information (indirectResult) is sent to driver. ShuffleMapTask generates a MapStatus,MapStatus that contains two items: one is the BlockManagerId of the BlockManager where the task is located (actually executorId + host, port, nettyPort), and the other is the size of each FileSegment output by the task. The result generated by ResultTask is the result of the execution of func on partition. * * for example, the func of count () is to count the number of records in partition. After receiving the task execution result result, Driver will perform a series of operations: a, first tell taskScheduler that the task has been executed, and then analyze the result. B, if it is the result of ResultTask, then you can use ResultHandler to calculate the driver side of result (for example, count () will sum all the result of ResultTask) c. If result is the MapStatus of ShuffleMapTask, then you need to store MapStatus (the location and size information of FileSegment output by ShuffleMapTask) in the mapStatuses data structure in mapOutputTrackerMaster so that you can query d later when reducer shuffle. If the task received by driver is the last task in the stage, you can submit the next stage. If the stage is already the last stage, tell dagScheduler job that it has been completed

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.