Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What are the performance characteristics of Spark UDF

2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/02 Report--

This article focuses on "what are the performance characteristics of Spark UDF". Interested friends may wish to take a look. The method introduced in this paper is simple, fast and practical. Now let the editor take you to learn "what are the performance characteristics of Spark UDF"?

Spark provides a variety of solutions to complex challenges, but we are faced with many scenarios where native functions are not enough to solve the problem. Therefore, Spark allows us to register custom functions (User-Defined Functions, or UDFs)

In this article, we will explore the performance features of Spark's UDF.

Spark supports multiple languages, such as Python, Scala, Java, R, SQL. But usually data operations are written in PySpark or Spark Scala. We believe that Pyspark is adopted by most users for the following reasons:

Faster learning curve-Python is easier than Scala.

Broader community support-programmers' suggestions such as Pyspark performance are fed back to the community to form a better ecology.

Rich available class libraries-Python has many libraries for machine learning, timing analysis, and mathematical statistics.

A small performance difference-with the introduction of Spark DataFrames, it means that the performance of Scala and Python is almost the same. Datafarme is now organized by named columns (named columns) so that Spark can better understand Schema. The operations used to build the dataframe are compiled by Catalyst Optimizer into a physical execution plan (physical execution plan) to speed up the calculation.

The handover code is also easier for data engineers and data scientists. Some dataframe operations require UDFs, and PySpark may have performance problems. Some of the solutions are to use PySpark with Scala UDF and UDF Wrapper.

When the PySpark job is submitted, driver runs on the Python, and driver creates a SparkSession object and Dataframes/RDDs. These Python objects are wrapper objects, which are essentially JVM (Java) objects. To simplify, PySpark provides a wrapper to run native Scala code.

Spark UDF function

Registering custom functions through Scala, Python, or Java is a very general way to extend the capabilities of SQL users, who can call these functions without having to write code.

For example, multiply a collection of 100w rows by 1000:

Def times1000 (field): return field * 1000.00

Alternatively, reverse geocoding (reverse geocode) is performed on the longitude and latitude dataset:

Import geohashdef geohash_pyspark (lat, lon): return geohash.encode (lat, lon)

Spark SQL provides a way to register UDF by passing in a function in your own programming language. Scala and Python can use native functions or lamdba syntax, except for Java, which is more cumbersome and needs to be extended to this UDF class.

UDF can act on many different data types and return a different type. In Python and Java, we need to specify the send return type.

UDF can be registered in the following ways:

Spark.udf.register ("UDF_Name", function_name, returnType ())

* returnType () is mandatory in Python and Java.

Multiple Spark UDF and execution methods

In distributed mode, Spark is executed using the master/worker architecture. The driver communicates with a large number of workers (or executors). Driver and worker are running in their own Java process.

The driver side creates SparkContext, RDDs and performs some transformation operations through the main () method. Executors is responsible for running one task after another.

Performance benchmark test

We created a random latitude and longitude dataset containing 100w records and a total of 1.2GB to test the performance of the three Spark UDF types. We created two UDF: a simple function multiplied by 1000 and a complex geohash function. (so there are a total of 2 * 3 = 6 sets of tests)

Cluster configuration: 8 nodes

Driver node: 16-core 122GB memory

Worker node: 4-core 30.5GB memory, enabling automatic expansion

Notebook code: https://bit.ly/2YxiVp4 uses the QuantumBlack's method to run Scala UDF, PySpark UDF and PySpark Pandas UDF tests.

In addition to the above three types of UDF, we also create a Python wrapper to call Scala UDF in Pyspark. We found that in this way, we can both use simple python programming and take into account the performance of Scala UDF.

Create a Python wrapper with Pyspark code:

From pyspark.sql.column import Columnfrom pyspark.sql.column import _ to_java_columnfrom pyspark.sql.column import _ to_seqfrom pyspark.sql.functions import col

Def udfGeohashScalaWrapper (lat, lon): _ geohash = sc._jvm.sparkudfperformance.UDFs.udfGeohash () return Column (_ geohash.apply (_ to_seq (sc, [lat, lon], _ to_java_column)) def udfTimes1000ScalaWrapper (field): _ times1000 = sc._jvm.sparkudfperformance.UDFs.udfTimes1000 () return Column (_ times1000.apply (_ to_seq (sc, [field], _ to_java_column)

Databricks made a performance report on Pandas UDF https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html

Important conclusion

Here are the test results

The performance of Scala UDF is the best among the test results. As mentioned earlier, the conversion steps between Scala and Python make Python UDF need to deal with more things.

We also find that PySpark Pandas UDF performs better than PySpark UDF on small datasets or simple functions. If it is a complex function, such as introducing geohash, the performance of PySpark UDF will be 10 times better than that of PySpark Pandas UDF in this scenario.

We also found that in the PySpark code, creating a Python wrapper to call Scala UDF performs 15 times better than these two PySpark UDFs.

Taking into account some of the above performance features, QuantumBlack now adopts the following approach:

With PySpark UDF, if the dataset is small and you need to use simple functions for quick data insight.

Build a reusable built-in library of Scala UDF.

Create a Python wrapper to call Scala UDF

At this point, I believe you have a deeper understanding of "what are the performance characteristics of Spark UDF?" you might as well do it in practice. Here is the website, more related content can enter the relevant channels to inquire, follow us, continue to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report