What are the major features of Apache Spark 3.0 04/19 Update SLTechnology News&Howtos

What are the major features of Apache Spark 3.0

2025-04-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article will explain in detail what are the major functions of Apache Spark 3.0. the content of the article is of high quality, so the editor will share it with you for reference. I hope you will have a certain understanding of the relevant knowledge after reading this article.

Apache Spark 3.0 adds many exciting new features, including dynamic Partition pruning (Dynamic Partition Pruning), Adaptive query execution (Adaptive Query Execution), Accelerator-aware scheduling (Accelerator-aware Scheduling), Catalog-enabled data source API (Data Source API with Catalog Supports), vectorization in SparkR (Vectorization in SparkR), support for Hadoop 3/JDK 11/Scala 2.12, and more.

A complete list of major features and changes in Spark 3.0.0-preview can be found here.

If you want to know Spark in time,

Spark 3.0 doesn't seem to have much Streaming/Structed Streaming-related ISSUE, and there may be several reasons for this:

At present, Spark Streaming/Structed Streaming based on Batch mode can meet most of the needs of enterprises, and there are few applications that really need very real-time computing, so the Continuous Processing module is still in the experimental stage and there is no rush to graduate.

Several bricks should be invested in a large number of people to develop Delta Lake-related things, which can bring income to enterprises, and this is their focus at present, so there is less investment in the natural development of Streaming.

Dynamic partition pruning (Dynamic Partition Pruning)

The so-called dynamic partition clipping is based on the information inferred by the runtime (run time) for further partition clipping. For example, we have the following query:

SELECT * FROM dim_iteblog

JOIN fact_iteblog

ON (dim_iteblog.partcol = fact_iteblog.partcol)

WHERE dim_iteblog.othercol > 10

Suppose that the dim_iteblog.othercol > 10 of the dim_iteblog table filters out less data, but because the previous version of Spark cannot dynamically calculate the cost, it may cause the fact_iteblog table to scan out a large amount of invalid data. With dynamic partition tailoring, you can filter out useless data from fact_iteblog tables at run time. After this optimization, the data scanned by the query is greatly reduced and the performance is improved by 33 times.

The ISSUE for this feature can be found in SPARK-11150 and SPARK-28888.

Adaptive query execution (Adaptive Query Execution)

Adaptive query execution (also known as Adaptive Query Optimisation or Adaptive Optimisation) is an optimization of query execution plans that allow Spark Planner to execute optional execution plans at run time, which are optimized based on run-time statistics.

As early as 2015, the Spark community put forward the basic idea of adaptive implementation, added an interface to submit a single map stage in Spark's DAGScheduler, and made an attempt to adjust the number of shuffle partition at runtime. However, at present, this implementation has some limitations, more shuffle, that is, more stage will be introduced in some scenarios, and it can not be handled well when three tables do join in the same stage; and it is difficult to flexibly implement other functions in adaptive execution using the current framework, such as changing the execution plan or dealing with skewed join at run time. Therefore, this feature has been in the experimental stage, and the configuration parameters are not mentioned in the official documentation. This idea mainly comes from Daniel of Intel and Baidu, see SPARK-9850 for details.

The Adaptive Query Execution of Apache Spark 3.0 is based on the idea of SPARK-9850, see SPARK-23128 for details. The goal of SPARK-23128 is to implement a flexible framework to perform adaptive execution in Spark SQL and to support changes in the number of reducer at run time. The new implementation addresses all the limitations discussed earlier, and other features, such as changing join policy and handling skewed join, will be implemented as separate functions and will be provided as plug-ins in later versions.

Accelerator aware scheduling (Accelerator-aware Scheduling)

Nowadays, big data has a great combination with machine learning. In machine learning, because computing iterations may take a long time, developers generally choose to use GPU, FPGA or TPU to speed up computing. Native support for GPU and FPGA has been built into Apache Hadoop 3.1. Spark, as a general-purpose computing engine, will certainly not be left behind. Engineers from Databricks, NVIDIA, Google and Alibaba are adding native GPU scheduling support to Apache Spark. This scheme fills the gap in the task scheduling of Spark in GPU resources, organically integrates big data processing and AI applications, and expands the application scenarios of Spark in deep learning, signal processing and big data applications. The issue for this work can be viewed in SPARK-24615, and the related SPIP (Spark Project Improvement Proposals) documentation can be found in SPIP: Accelerator-aware scheduling.

Currently, the resource managers YARN and Kubernetes supported by Apache Spark already support GPU. In order for Spark to support GPUs, there are two major changes that need to be made at the technical level:

At the cluster manager level, cluster managers needs to be upgraded to support GPU. And provide users with relevant API, so that users can control the use and allocation of GPU resources.

Within Spark, modifications need to be made at the scheduler level so that scheduler can identify the GPU requirements in the user's task request, and then complete the allocation according to the GPU supply on the executor.

Because it is a big feature to have Apache Spark support GPU, the project is divided into several phases. In Apache Spark 3.0, GPU support will be supported under standalone, YARN, and Kubernetes Explorer, with little impact on existing normal jobs. Support for TPU, support for GPU in Mesos Explorer, and GPU support for the Windows platform will not be the targets of this release. And fine-grained scheduling within a GPU card will not be supported in this version; Apache Spark 3.0 will treat a GPU card and its memory as an inseparable unit.

Apache Spark DataSource V2

Data Source API defines the relevant API interfaces for reading and writing from the storage system, such as Hadoop's InputFormat/OutputFormat,Hive 's Serde, and so on. These API are very suitable for users to use when programming with RDD in Spark. Programming with these API can solve our problems, but it is expensive for users to use, and Spark cannot optimize it. In order to solve these problems, Spark version 1.3 began to introduce Data Source API V1, through which we can easily read data from various sources, and Spark uses some optimization engines of SQL components to optimize the reading of data sources, such as column clipping, filter push, and so on.

Data Source API V1 abstracts a series of interfaces for us, which can be used to implement most scenarios. However, with the increase in the number of users, some problems gradually emerge:

Some interfaces rely on SQLContext and DataFrame

The scalability is limited, so it is difficult to push down other operators.

Lack of support for column storage reads

Lack of partition and sorting information

Write operations do not support transactions

Stream processing is not supported

In order to solve some problems of Data Source V1, starting from Apache Spark version 2.3.0, the community introduced Data Source API V2, which not only retained the original functions, but also solved some problems existing in Data Source API V1, such as no longer relying on upper-level API and enhanced scalability. The ISSUE for Data Source API V2 can be found in SPARK-15689. Although this feature appeared in Apache Spark 2.x, it is not very stable, so the community has opened two ISSUE for the stability of Spark DataSource API V2 and the new features: SPARK-25186 and SPARK-22386. The final stable version of Spark DataSource API V2 and the new features will be released with Apache Spark 3.0.0 at the end of the year, which is also a major new feature of Apache Spark 3.0.0.

Better ANSI SQL compatibility

PostgreSQL is one of the most advanced open source databases, which supports most of the major features of SQL:2011, and PostgreSQL meets at least 160of the 179 functions that fully meet the requirements of SQL:2011. At present, the Spark community has opened a special ISSUE SPARK-27764 to address the differences between Spark SQL and PostgreSQL, including functional feature completion, Bug modification and so on. Functional patching includes some functions that support ANSI SQL, distinguishing between SQL reserved keywords and built-in functions. There are 231 sub-ISSUE under this ISSUE. If all the ISSUE in this part is solved, then the difference between Spark SQL and PostgreSQL or ANSI SQL:2011 is even smaller.

SparkR vectorized read-write

Spark has supported the R language since version 1.4, but the architecture for the interaction between Spark and R at that time is as follows:

Whenever we use the R language to interact with the Spark cluster, we need to go through JVM, which can not avoid data serialization and deserialization, which is very poor performance in the case of a large amount of data!

Moreover, Apache Spark has performed vectorization optimization (vectorization optimization) in many operations, such as internal determinant format (columnar format), Parquet/ORC vectorization reading, Pandas UDFs, and so on. Vectorization can greatly improve performance. SparkR vectorization allows users to use existing code as is, but can improve performance by about thousands of times when they perform R native functions or convert Spark DataFrame to and from R DataFrame. You can take a look at SPARK-26759 for this work.

It can be seen that SparkR vectorization uses Apache Arrow, which makes the interaction of data between systems very efficient, and avoids the consumption of data serialization and deserialization, so after adopting this, the performance of SparkR and Spark interaction has been greatly improved.

Other

Spark on K8S:Spark 's support for Kubernetes starts with version 2.3, Spark 2.4 has been upgraded, and Spark 3.0 will add support for Kerberos and dynamic resource allocation.

Remote Shuffle Service: the current Shuffle has many problems, such as poor elasticity, great impact on NodeManager, and does not adapt to the cloud environment. To solve the above problem, Remote Shuffle Service will be introduced. For more information, see SPARK-25299

Support for JDK11: see SPARK-24417, JDK11 is chosen directly because JDK 8 is about to reach EOL (end of life), while JDK9 and JDK10 are already EOL, so the community skips JDK9 and JDK10 and supports JDK11 directly. However, Spark 3.0 preview version still uses JDK 1.8 by default.

Remove support for Scala 2.11. Scala 2.12 is supported by default. For more information, please see SPARK-26132

Hadoop 3.2is supported. For more information, it has been 2 years since SPARK-23710,Hadoop 3.0 was released (Apache Hadoop 3.0.0-beta1 is officially released and the next version (GA) will be available online), so it is natural to support Hadoop 3.0.But Hadoop 2.7.4 is still used by default in the preview version of Spark 3.0.

What are the major features of Apache Spark 3.0 to share here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.