Parallelism determination Mechanism of Spark Partition 04/21 Update SLTechnology News&Howtos

Parallelism determination Mechanism of Spark Partition

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

This article focuses on "Spark partition parallelism determination mechanism", interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor to take you to learn the "Spark partition parallelism determination mechanism"!

We all know that the smallest execution unit in Spark job is task, and a reasonable setting of the number of task per stage in Spark job is one of the important factors that determine the performance, but the ability of Spark to determine the best degree of parallelism is limited, which requires us to go to various tests and calculations to determine the best parameter ratio on the premise of understanding the internal mechanism.

When the Spark task executes, the RDD is divided into different stage, and the number of task in one stage is the same as the number of partitions in the last RDD. As mentioned earlier, the key to stage partitioning is wide dependencies, which are often accompanied by shuffle operations. For one stage to receive input from another stage, this operation usually has a parameter numPartitions to display the specified number of partitions. The most typical are some ByKey operators, such as groupByKey (numPartitions: Int), but this number of partitions needs to be tested multiple times to determine the appropriate value. First determine the number of partitions in the parent RDD (the number of partitions in the RDD can be determined by rdd.partitions (). Size ()), then increase the number of partitions on this basis, and debug several times until the determined resource tasks can run smoothly and safely.

For a RDD without a parent RDD, such as a RDD generated by loading data on the HDFS, the number of partitions is determined by the InputFormat sharding mechanism. Usually a HDFS block block corresponds to a partition, and for an unsharable file, a file corresponds to a partition.

The number of RDD partitions generated through the parallelize method of SparkContext or makeRDD can be specified directly in the method. If not specified, refer to the parameter configuration of spark.default.parallelism. The following is the source code that determines the defaultParallelism by default:

Override def defaultParallelism (): Int = {

Conf.getInt ("spark.default.parallelism", math.max (totalCoreCount.get (), 2))

}

Typically, RDD has the same number of partitions as the RDD on which it depends, except for shuffle. But there are several special operators:

1.coalesce and repartition operators

The author first shows two source code diagrams of the coalesce operator in RDD and DataSet respectively: (DataSet is a distributed data set in Spark SQL, which will be discussed in more detail later when it comes to Spark)

Through coalesce source code analysis, whether in RDD or DataSet, coalesce does not generate shuffle by default, and the number of RDD partitions created through coalesce is less than or equal to the number of partitions of the parent RDD.

The author here does not put the source code of the repartition operator, the analysis is also relatively simple, I have some hints in the figure. However, the author suggests that you use the repartition operator in the following two cases:

1) increasing the number of partitions can increase the number of partitions when repartition triggers shuffle,shuffle.

Coalesce does not trigger shuffle by default, and even if this operator is called to increase the number of partitions, the actual situation is that the number of partitions is still the current number of partitions.

2) reduce the number of partitions in extreme cases, such as reducing the number of partitions to 1 and adjusting the number of partitions to 1. In this case, the parallelism of upstream stage in data processing decreases, which affects the performance. At this time, the advantage of repartition is reflected without changing the parallelism of the original stage, which is more obvious in the case of a large amount of data. Note, however, that because repartition triggers shuffle, you need to measure the cost of a good shuffle and the benefits of increasing parallelism with repartition.

2.union operator

Or look directly at the source code:

Through the analysis of the source code, when RDD calls the union operator, the final number of RDD partitions generated is divided into two cases: 1) union's RDD dividers are defined and their dividers are the same.

Multiple parent RDD have the same divider, and the RDD generated after union has the same divider and the same number of partitions as the parent RDD. For example, n RDD have the same partition and are defined, and the number of partitions is m. Then the number of partitions of a RDD generated by the n RDD final union is still m, and the divider is the same.

2) if the first case is not satisfied, the number of partitions of the RDD generated through union is the sum of the number of partitions of the parent RDD. 4.cartesian operator

Through the above introduction of coalesce, repartition, union operators and source code analysis, it is easy to analyze the source code of cartesian operators. Through cartesian, the number of RDD partitions is the product of the number of its parent RDD partitions.

In Spark SQL, the task parallelism parameter should refer to spark.sql.shuffle.partitions. The author puts a picture here, and we will talk about it in more detail when we talk about Spark SQL later:

Take a look at the following figure. In Spark streaming computing, SparkStreaming and Kafka are usually integrated. Here, there are two cases:

The micro-batch RDD generated by 1.Receiver is BlockRDD, and the number of partitions is the number of block

The micro-batch RDD generated by 2.Direct is kafkaRDD, and the number of partitions corresponds to the number of kafka partitions one by one

At this point, I believe that everyone on the "Spark partition parallelism decision mechanism" have a deeper understanding, might as well to the actual operation of it! Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.