What is the preview version analysis of Spark3.0 technology based on CDP7.1.1 07/02 Update SLTechnology News&Howtos

What is the preview version analysis of Spark3.0 technology based on CDP7.1.1

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about the CDP7.1.1-based Spark3.0 technology preview version analysis, the editor thinks it is very practical, so I share it with you to learn. I hope you can get something after reading this article. Let's take a look at it with the editor.

Here are the key new features of Spark3:

Further improvement in TPC-DS performance of 1.Spark3

two。 Language support

A) Scala version is upgraded to 2.12

B) JDK11 is fully supported.

C) Python3.6+ is supported. Python 2 and Python 3 prior to version 3.6 are deprecated

3.Adaptive execution of Spark SQL

A) for AQE, the most important question is when to recalculate the optimization execution plan. If the operators of the Spark task are arranged in pipes, they are executed in parallel in turn. However, shuffle or broadcast exchange interrupts the permutation execution of the operator, which we call the Materialization Points, and uses "Query Stages" to represent the small fragments that are divided by the materialization point. Each Query Stage produces intermediate results, and downstream Query Stage can be executed only if and only after the stage and all its parallel stage are executed. So when the upstream part of the stage execution is complete, the statistics of the partitions are also obtained, and the downstream execution has not yet started, which provides the AQE with the opportunity of reoptimization. At the beginning of the query, after the execution plan is generated, the AQE framework first finds and executes those stages that do not exist upstream. Once one or more of these stage are completed, the AQE framework marks it as complete in the physical plan and updates the entire logical plan based on the execution data provided by the completed stages. Based on these new output statistics, the AQE framework executes optimizer, which is optimized according to a series of optimization rules; the AQE framework also executes optimizer that generates ordinary physical plan and adaptive execution-specific optimization rules, such as partition merging, data skew processing, and so on. As a result, we have the latest optimized execution plan and some stages that have already been executed, and this is a cycle. Then we just need to repeat the above steps until the entire query is finished.

4.Dynamic Partition Pruning (DPP)

A) Spark 3.0 introduces dynamic partition clipping, which is a major performance improvement for SQL analysis workloads. The idea behind DPP is to apply the filter set on the dimension table directly to the fact table in order to skip scanning unwanted partitions. The optimization of DPP is realized in logical plan optimization and physical plan. It greatly enhances the speed of many TPC-DS queries and adapts well to the star model without Denormalization the table.

5.Binary files data source

A) Spark 3.0 supports binary file data sources. It can read binary files and convert each file to a line that contains the original contents and metadata of the file.

6.DataSource V2 Improvements

A) Pluggable catalog integration

B) improve the predicate push-down function to speed up queries by reducing data loading

7.YARN Features

A) Spark 3.0 can automatically discover GPU on the YARN cluster and schedule tasks to specified GPU nodes.

8.Kafka connector delegation token (0.10 +)

A) for the application, you only need to configure the parameters of Spark to complete the authentication login, instead of using JAAS configuration to log in.

The following components are not supported in this pilot version:

Hive Warehouse Connector

Kudu

HBase Connector

Oozie

Livy

Zeppelin

The above is the preview version analysis of Spark3.0 technology based on CDP7.1.1, and the editor believes that there are some knowledge points that we may see or use in our daily work. I hope you can learn more from this article. For more details, please follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.