How to understand the dynamic Partition clipping Optimization of Spark 3.0 07/12 Update SLTechnology News&Howtos

How to understand the dynamic Partition clipping Optimization of Spark 3.0

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail how to understand the dynamic partition clipping optimization of Spark 3.0. the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have some understanding of the relevant knowledge after reading this article.

Spark 3. 0 brings us many features to look forward to. Dynamic partition clipping (dynamic partition pruning) is one of them. This article will lead you to understand what dynamic partition clipping is in the form of pictures and text.

Static Partition clipping in Spark

Before introducing dynamic partition clipping, it is necessary to introduce static partition clipping in Spark. In standard database terminology, tailoring means that the optimizer will avoid reading files that do not contain the data we are looking for. For example, we have the following query SQL:

Select * from iteblog.Students where subject = 'English'

In this simple query, we try to match and identify the record of subject = English in the Students table. It would be foolish to scan all the data first, and then filter it with subject = 'English'. As shown in the following figure:

A better implementation is for the query optimizer to push the filter down to the data source so that it can avoid scanning the entire dataset, as Spark does, as shown in the following figure:

In the static partition clipping technique, our table is partitioned first, and the idea of partition filtering and push-down is the same as the filter push down above. Because in this case, if our query has a filter for partition columns, then many unnecessary partitions can be skipped in the actual query, thus greatly reducing the scanning of data and disk Imax O, thus improving the performance of computing.

In reality, however, our query statement will not be that simple. Usually, we have multiple dimension tables, and small tables need to be join with large fact tables. Therefore, in this case, we can no longer apply static partition clipping because the filter condition is on one side of the join table, and the table useful for cropping is on the other side of the Join. For example, we have the following query statement:

Select * from iteblog.Students join iteblog.DailyRoutine where iteblog.DailyRoutine.subject = 'English'

For the above query, the final execution plan of the junk query engine is as follows:

It associates the data of the two tables and then filters them. In the case of a large amount of data, the efficiency can be imagined. Some better computing engines can be optimized, such as:

It can filter some useless data in a table, and then Join, the efficiency is naturally better than the previous one. But if we do it, we can actually push the subject = 'English' filter into the iteblog.Students table, which is the dynamic partition clipping optimization that Spark 3.0 brings to us.

Dynamic partition clipping

In Spark SQL, users usually submit queries in their favorite programming language and choose their favorite API, which is why there are DataFrames and DataSet. Spark converts this query into an easy-to-understand form, which we call the logical plan of the query (logical plan). At this stage, Spark optimizes the logical plan by applying a set of rule based-based transformations such as column pruning, constant folding, and operator pushdown. It then enters the actual physical plan (physical planning) of the query. During the physical planning phase, Spark generates an executable plan (executable plan) that distributes the calculations across the cluster. In this article, I will explain how to implement dynamic partition pruning during the logical planning phase. Then, we will look at how to further optimize it during the physical planning phase.

Logical planning stage optimization

Suppose we have a fact table (fact table) with multiple partitions, and for illustration purposes, we use different colors to represent different partitions. In addition, we have a relatively small dimension table (dimension table), our dimension table is not a partition table. Then we perform a typical scan operation on these data sets. In our example, suppose we read only two rows of data in the dimension table, which are actually for the two partitions of the other table. So when you finally perform the Join operation, the fact table with the partition only needs to read the data from the two partitions.

Therefore, we do not need to actually scan the entire fact table. To do this optimization, a simple way is to construct a filtered subquery from the dimension table (such as select subject from iteblog.DailyRoutine where subject = 'English' in the above example), and then add the filtered subquery before scanning the fact table.

In this way, we know which partitions of the fact table need to be scanned during the logical planning phase.

However, the above physical plan is still relatively inefficient. Because there is a duplicate subquery, we need to find a way to eliminate the duplicate subquery. To do this, Spark made some optimizations during the physical planning phase.

Physical planning phase optimization

If the dimension table is small, Spark will most likely execute the Join in the form of broadcast hash join. The implementation of Broadcast Hash Join is to broadcast (broadcast) the data of the small table to all the Executor sides of the Spark. This broadcast process is no different from our own broadcast data. First, we use the collect operator to pull the data of the small table from the Executor side to the Driver side, and then call sparkContext.broadcast on the Driver side to broadcast to all Executor sides. On the other hand, the large table will also build a hash table (called build relation), and then the broadcast data on the Executor side will perform Join operations with the corresponding partition of the large table. This Join strategy avoids Shuffle operations. The details are as follows:

We already know the principle of broadcast hash join implementation. In fact, the optimization of dynamic partition clipping is to get the broadcast result (broadcast results) of the dimension table during build relation in broadcast hash join, and then dynamically filter it during build relation (before Scan), so as to avoid scanning useless data. The details are as follows:

Well, the above is the optimization of dynamic partition tailoring in logical and physical plans.

Applicable conditions for dynamic partition clipping

Dynamic clipping optimization is not enabled for all queries, and the following conditions must be met:

The spark.sql.optimizer.dynamicPartitionPruning.enabled parameter must be set to true, but this value is enabled by default

The table to be trimmed must be a partitioned table, and the partition field must be within the on condition of join

The Join type must be INNER, LEFT SEMI (the left table is the partition table), LEFT OUTER (the right table is the partition table), and or RIGHT OUTER (the left table is the partition table).

Meeting the above conditions does not necessarily trigger dynamic partition reduction, but also must meet the two parameters spark.sql.optimizer.dynamicPartitionPruning.useStats and spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio to comprehensively evaluate whether dynamic partition reduction is beneficial or not, and only when it is satisfied will dynamic partition reduction be carried out.

On how to understand Spark 3.0 dynamic partition tailoring optimization is shared here, I hope the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.