Example Analysis of tasks in Spark 07/19 Update SLTechnology News&Howtos

Example Analysis of tasks in Spark

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the example analysis of tasks in Spark, which has a certain reference value, and interested friends can refer to it. I hope you will gain a lot after reading this article.

The Task is the minimum execution unit of Spark, and the Spark task is executed through Task. Spark's task system is the most mysterious and easy to learn the core module, the task execution mechanism is thorough, then Spark will understand more deeply. Task is an abstract class of the task system, which has two subclasses: ResultTask and ShuffleMapTask, which constitute the core of the task system.

ResultTask is easy to understand, that is, it directly executes the data operation of a partition of RDD in Task. Remember the structure of RDD before, there is a compute function, the task is to execute the compute function.

ShuffleMapTask also performs data operations on a partition of RDD in Task, except that the output is stored in a different way. ShuffleMapTask saves the results of data operations to global storage similar to BlockManager, and the results of ShuffleMapTask can be used as input data by the next Task. Why are there two kinds? To put it another way, it is clear that ResultTask corresponds to narrowly dependent RDD,ShuffleMapTask corresponds to broadly dependent RDD operations (such as full join operations). ShuffleMapTask needs special processing for reading and writing data, using BlockManager to output the dataset; similarly, the read dataset of the child RDD of ShuffleMapTask is also from BlockManager.

The code for the ResultTask and ShuffleMapTask classes is very simple, which is to override the runTask method.

Task deserializes through Task description objects, after obtaining objects such as RDD and partition, creates TaskContextImpl as task context, then executes the run method to run the task, reads iterator data in RDD and processes the data. The run method is actually executed by calling the runTask method overridden by the subclass. The runTask method is overridden in ResultTask and ShuffleMapTask.

1 、 ResultTask

As a direct result of the task, this type of task is finished, and its data does not need to be processed again by the next task. Yes, the mission is the Terminator mission.

Override the runTask method. The core code of the runTask method is as follows:

Override def runTask (context: TaskContext): U = {val ser = SparkEnv.get.closureSerializer.newInstance () val (rdd, func) = ser.deserialize [(RDD [T], (TaskContext, Iterator [T]) = > U)] (ByteBuffer.wrap (taskBinary.value), Thread.currentThread.getContextClassLoader) func (context, rdd.iterator (partition, context))}

Deserialization shows that the data processing function func,func defined in RDD conforms to the format:

(TaskContext, Iterator [T]) = > U

Then execute:

Func (context, rdd.iterator (partition, context))

This method means to poll the data iterator of the rdd partition, fetching one piece of data at a time and performing a func operation. The rewrite part of ResultTask is that simple.

2 、 ShuffleMapTask

Tasks in ShuffleMap format, the execution results of which are to be consumed by the next RDD, so the output data needs to be written out to the Shuffle area. The Shuffle area is described in detail in Partition data Management.

Override the runTask method. The core code of the runTask method is as follows:

Override def runTask (context: TaskContext): MapStatus = {val ser = SparkEnv.get.closureSerializer.newInstance () val rddAndDep = ser.deserialize [(RDD [_], ShuffleDependency [_, _])] (ByteBuffer.wrap (taskBinary.value), Thread.currentThread.getContextClassLoader) val rdd = rddAndDep._1 val dep = rddAndDep._2 dep.shuffleWriterProcessor.write (rdd, dep, partitionId, context, partition)}

The first half is similar to Result, deserializing to get RDD and partition, as well as dependent partition dep. The data in the rdd is then iterated and written to the dep-dependent shuffle region.

Thank you for reading this article carefully. I hope the article "sample Analysis of tasks in Spark" shared by the editor will be helpful to you. At the same time, I also hope you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.