In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-04-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/02 Report--
This article mainly introduces "the specific use process of the dynamic partition clipping mechanism of spark 3.0sql". In the daily operation, it is believed that many people have doubts about the specific use process of the dynamic partition clipping mechanism of spark 3.0sql. The editor consulted all kinds of materials and sorted out a simple and useful operation method. I hope it will be helpful to answer your doubts about the specific use of the dynamic partition clipping mechanism of spark 3.0 sql! Next, please follow the editor to study!
This article mainly talks about the dynamic partition clipping mechanism introduced after spark 3.0, which will greatly improve the performance of applications, especially in scenarios such as bi, where there are a large number of where conditional operations.
Dynamic partition clipping is a little more complicated than predicate pushdown because it integrates the filtering conditions of dimension tables to generate filterset, which is then used to filter fact tables, thereby reducing join. Of course, it is better to assume that the data source can be pushed down directly, which requires similar content such as indexing and precomputation.
1. Static dataset partition predicate push-down execution the following sql is an example of SELECT * FROM Sales WHERE day_of_week = 'Mon' there are two possibilities for this statement execution:
1). Scan the whole table and filter it. 2). Filter and then scan.
If the table is partitioned by the day_of_week field, the sql should push down the filter, filter it first, and then scan it.
This is the predicate push-down execution when there is an index and precomputation in the traditional database. two。 Dynamic partition clipping scenario Spark 3.0Partition clipping scenario is mainly based on predicate push down to perform filter (dynamic generation), and then applied to the fact table and dimension table join scenarios. If there is a filter on the partitioned table and dimension table, dynamic partitioning pruning of another table is achieved by adding dynamic-partition-pruning filter. There is a simple sql that completes the join of the fact table (sales) and the dimension table (Date): SELECT * FROM Sales JOIN Date WHERE Date.day_of_week = 'Mon';. If there is no optimization to push down, the execution process should look like the following figure:
The figure above shows that there is no predicate to push down the optimization calculation process, scan the fact table sales and dimension table date table fully, then complete the join, and then perform the filter operation on the generated table, and then scan the calculation, which is obviously a waste of performance.
If the dimension table supports push-down execution, then the filter operation of the dimension table can be carried out first, the data load of the dimension table Date can be reduced, then the scan of the fact table sales and the scan of the dimension table date can be performed, and finally the join operation can be carried out. Think about it, because the filter of the where condition is a dimensional table Date, spark also needs to use scanned full table data to implement join with the dimensional table Date when reading the fact table, which greatly increases the amount of computation. If we can further optimize, through the dimension table date filter, generate a new fact table salesFilterSet, applied to the fact table sales, then we can greatly reduce the join computing performance consumption. This is what it looks like: this is called dynamic partition clipping. The following example is more detailed: tables T1 and T2 perform join. In order to reduce the amount of data participating in the join calculation, a filter data set is generated for the T1 table calculation (sql on the right of the figure above), then scanned and filtered. Of course, this is a tradeoff between the performance consumption of subqueries and saves generated by filter datasets and the performance optimization of data filtering on join, which is about the optimization model of spark sql. How does spark sql implement sql optimization operations? A diagram can be summarized as follows: now the sql syntax optimization is completed in the process of sql parsing, and then the optimization is performed dynamically according to the statistical cost model. The optimization of logical execution plan is static, and the selection of physical plan can be calculated based on the statistical cost model. The following figure is a join implementation based on partition ID. The data of the dimension table is not partitioned, while the data of the fact table is partitioned. If there is no dynamic partition clipping, the completed execution process is shown in the figure. Both the fact table and the dimension table require a full table scan, followed by a filter operation on the dimension table, and finally a join operation. If you do some calculations on the filter of the dimension table and then generate the filter set of the fact table, you can reduce the amount of data in the dimension table and the fact table join. Just like the previous join examples for T1 and T2.
Of course, the above example considers whether the cost of calculating and saving the filter set collection of fact tables is much less than its gain in reducing the amount of join data, otherwise the loss outweighs the gain. There is another kind of join that everyone is familiar with, and that is Broadcast Hash Join. This is mainly the result of reusing broadcasts to implement filter functions. The understanding of this is based on BroadcastExchangeExec. Let's talk about it in detail in an article later. As for the effect code, you can follow the official account of Langjian Wechat: bigdatatip. Then enter: dpp to get the full ppt.
At this point, the study on "the specific use of the dynamic partition clipping mechanism of spark 3.0 sql" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.