What are the new features of Spark 3.0? 07/11 Update SLTechnology News&Howtos

What are the new features of Spark 3.0?

2025-07-11 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Today, I would like to talk to you about the new features of Spark 3.0. maybe many people don't know much about it. In order to make you understand better, the editor summed up the following for you. I hope you can get something from this article.

Recently, the Apache Spark community released a preview of Spark 3.0, which contains many important new features that will help Spark create a powerful impact, in this era of big data and data science, the product already has a wide range of enterprise users and developers.

In the new version, the Spark community has migrated some features from Spark SQL to programmatic Scala API (

Org.apache.spark.sql.functions), which encourages developers to use this functionality directly as part of their DataFrame transformation, rather than typing directly into SQL mode or creating views, and using this function as well as SQL expressions or callUDF functions.

The community has also laboriously introduced new data conversion capabilities and partition_transforms functions that are useful when used with Spark's new DataFrameWriterv2 to write data to some external storage.

Some of the new features in Spark 3 are already part of previous versions of Databricks Spark. Therefore, if you work in the Databricks cloud, you may find some familiar features.

The following describes the new Spark features in Spark SQL and Scala API for DataFrame operation access, as well as the ability to migrate from Spark SQL to Scala API for programmatic access.

Features introduced in Spark 3.0 in Spark SQL and features for DataFrame transformation

From_csv

Like from_json, this function parses the column that contains the CSV string and converts it to type Struct. If the CSV string cannot be parsed, null is returned.

Example:

This function requires a Struct pattern and options that indicate how to parse CSV strings. The option is the same as the CSV data source.

Ss= "dp-sql" > ss= "alt" > val studentInfo = ss= "string" > "1Nil ss=": ss= "string" > "2Nil ss=": ss= "string" > "3Nil ss=" > val ss= "keyword" > schema = new StructType () ss= "alt" > .ss = "keyword" > add (ss= "string" > "Id", IntegerType) ss= "> .ss =" keyword "> add (ss=" string ">" Name ") StringType) ss= "alt" > .ss = "keyword" > add (ss= "string" > "Dept", StringType) ss= "" > val options = Map (ss= "string" > "delimiter"-> ss= "string" > ",") ss= "alt" > val studentDF = studentInfo.toDF (ss= "string" > "Student_Info") ss= "> .withColumn (ss=" string ">" csv_struct ", from_csv ('Student_Info, ss= keyword" > schema,options) ss= "alt" > studentDF.show ()

To_csv

To convert the structure Type column to an CSV string.

Example:

Along with the Struct type column, this function also accepts an optional options parameter that indicates how to convert the Struct column to a CSV string.

Ss= "dp-sql" > ss= "alt" > studentDF ss= "> .withColumn (ss=" string ">" csv_string ", to_csv ($ss=" string ">" csv_struct ", Map.empty [String, String] .asJava) ss=" alt "> .show

Infers the pattern of the CSV string and returns the pattern in DDL format.

Example:

This function requires a CSV string column and an optional argument that contains options for how to parse the CSV string.

Ss= "dp-sql" > ss= "alt" > studentDF ss= "" > .withColumn (ss= "string" > "schema", schema_of_csv (ss= "string" > "csv_string")) ss= "alt" > .show

For_all

Applies the given predicate to all elements in the array and returns true only if all elements in the array evaluate to true, otherwise it returns false.

Example:

Check that all elements in the given Array column are even.

Ss= "dp-sql" > ss= "alt" > val df = Seq (Seq (2pjung 4pc6), Seq (5pje 10p3) .toDF (ss= "string" > "int_array") ss= "> df.withColumn (ss=" string ">" flag ", forall ($ss=" string ">" int_array ", (XRV ssss =" keyword "> Column) = > (lit (x% 2pyr0) ss=" alt ">. Show

Transform

After applying the function to all elements in the array, a new array is returned.

Example:

Add "1" to all elements in the array.

Ss= "dp-sql" > ss= "alt" > val df = Seq ((Seq (2pje 4dagai 6)), (Seq (5je 10pai3)) .toDF (ss= "string" > "num_array") ss= "> df.withColumn (ss=" string ">" num_array ", transform ($ss=" string ">" num_array ", x = > xanth1). Show)

Overlay

To replace the contents of a column, use the actual replacement from the specified byte position to the optional specified byte length.

Example:

Change the greeting of a specific person to the traditional "Hello World"

Here we replace the name with the world, because the starting position of the name is 7, and we want to delete the full name before replacing the content, the length of the byte position to be deleted should be greater than or equal to the length of the name in the maximum column.

Therefore, we pass the replacement word "world", replace the content with the specific starting position of "7", and remove "12" from the specified starting position (if not specified, the position is optional, the function will only replace the source content with the replacement content from the specified starting position).

The override replaces the content in StringType,TimeStampType,IntegerType, etc. However, the return type of Column will always be StringType, regardless of the Column input type.

Ss= "dp-sql" > ss= "alt" > val greetingMsg = ss= "string" > "Hello Arun":: ss= "string" > "Hello Mohit Chawla":: ss= "string" > "Hello Shaurya":: Nil ss= "> val greetingDF = greetingMsg.toDF (ss=" string ">" greet_msg ") ss=" alt "> greetingDF.withColumn (ss=" string ">" greet_msg ", overlay ($ss=" string ">" greet_msg ", lit (ss=" string ">" World "), lit (ss=" ss= ">" 7 ") Lit (ss= "string" > "12")) ss= "> .show

Split

Splits the string expression based on the given regular expression and the specified limit, which indicates the number of times the regular expression is applied to the given string expression.

If the specified limit is less than or equal to zero, the regular expression is applied multiple times on the string, and the resulting array contains all possible string splits based on the given regular expression.

If the specified limit is greater than zero, a regular expression that does not exceed that limit is used

Example:

Splits a given string expression into two based on a regular expression. The string delimiter.

Ss= "dp-sql" > ss= "alt" > val num = ss= "string" > "one~two~three":: ss= "string" > "four~five":: Nil ss= "> val numDF = num.toDF (ss=" string ">" numbers ") ss=" alt "> numDF ss=" > .withColumn (ss= "string" > "numbers", split ($ss= "string" > "numbers", ss= "string" > "~", 2)) ss= "alt" > show.

Divide the same string expression into multiple parts until a delimiter appears

Ss= "dp-sql" > ss= "alt" > numDF ss= "> .withColumn (ss=" string ">" numbers ", split ($ss=" string ">" numbers ", ss=" string ">" ~ ", 0)) ss=" alt ">. Show

Map_entries

Converts the mapped key value to an unordered array of entries.

Example:

Gets all the keys and values of Map in the array.

Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (1-> ss= "string" > "x", 2-> ss= "string" > "y") .toDF (ss= "string" > "key_values") ss= "> df.withColumn (ss=" string ">" key_value_array ", map_entries ($ss=" string ">" key_values ") ss=" alt "> show.

Map_zip_with

Use the function to merge two Map into one according to the key.

Example:

To calculate the total sales of cross-departmental employees, and by passing a function, this function summarizes the total sales from two different Map columns based on the key, thus obtaining the total sales of a particular employee in a single map.

Ss= "dp-sql" > ss= "alt" > val df = Seq ((Map (ss= "string" > "EID_1"-> 10000 string ss= "string" > "EID_2"-> 25000), ss= "" > Map (ss= "string" > "EID_1"-> 1000pint SS = "string" > "EID_2"-> 2500)) .toDF (ss= "string" > "emp_sales_dept1", ss= "string" > "emp_sales_dept2") ss= "alt" > ss= "> df. Ss= "alt" > withColumn (ss= "string" > "total_emp_sales", map_zip_with ($ss= "string" > "emp_sales_dept1", $ss= "string" > "emp_sales_dept2", (kmeme v1 string v2) = > (v1+v2)) ss= "> .show

Map_filter

Returns a new key-value pair that contains only Map values that satisfy the given predicate function.

Example:

Only employees with a sales value of more than 20000 are screened.

Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (ss= "string" > "EID_1"-> 10000 string ss= "string" > "EID_2"-> 25000)) ss= "> .toDF (ss=" string ">" emp_sales ") ss=" alt "> ss=" > df ss= "alt" > .withColumn (ss= "string" > "filtered_sales", map_filter ($ss= "string" > emp_sales ", (KMAV) = > (v > 20000)) ss=" > show.

Transform_values

Manipulates the values of all elements in the Map column according to the given function.

Example:

Calculate an employee's salary by giving each employee a 5000 raise

Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (ss= "string" > "EID_1"-> 10000 string ss= "string" > "EID_2"-> 25000)) ss= "> .toDF (ss=" string ">" emp_salary ") ss=" alt "> ss=" > df ss= "alt" > .withColumn (ss= "string" > "emp_salary", transform_values ($ss= "string" > emp_salary ", (KMJ v) = > (vault 5000)) ss=" > show.

Transform_keys

Manipulates the keys of all elements in the Map column according to the given function.

Example:

To add the company name "XYZ" to the employee number.

Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (ss= "string" > "EID_1"-> 10000, ss= "string" > "EID_2"-> 25000)) ss= "> .toDF (ss=" string ">" employees ") ss=" alt "> df ss=" > .withColumn (ss= "string" > "employees", transform_keys ($ss= "string" > "employees", (k, v) = > concat (k) Lit (ss= "string" > "_ XYZ") ss= "alt" > .show

Xhash74

To calculate the hash code for the contents of a given column, use the 64-bit xxhash algorithm and return the result as long.

The function of DataFrame conversion from Spark SQL to Scala API in Spark 3.0

Scala API can use most Spark SQL functions, which can use the same function as part of a DataFrame operation. However, there are still some features that cannot be used as programming functions. To use these features, you must enter Spark SQL mode and use them as part of an SQL expression, or use the Spark "callUDF" feature to use the same functionality. With the popularity and use of functions, some of these functions have been transplanted to new versions of programmed Spark API in the past. The following is migrated from the previous version of the Spark SQL function to Scala API

Function of org.spark.apache.sql.functions)

Date_sub

Subtract the number of days from the date, time stamp, and string data types. If the data type is a string, its format should be convertible to date "yyyy-MM-dd" or "yyyy-MM-dd HH:mm:ss.ssss"

Example:

Subtract "1 day" from eventDateTime.

If the number of days to be subtracted is negative, this feature adds the given number of days to the actual date.

Date_add

Same as date_sub, but days are added to the actual number of days.

Example:

Add "1 day" to eventDateTime

Ss= "dp-sql" > ss= "alt" > var df = Seq (ss= "> (1, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-01 23:00:01 ")), ss=" alt "> (2, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-02 12:40:32 "), ss=" > (3) Ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-03 09:54:00"), ss= "alt" > (4, ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-04 10:12:43") ss= ">) ss=" alt "> .toDF (ss=" string ">" Id ", ss=" string ">" eventDateTime ") ss=" > df ss= "alt > .withColumn (ss=" string ">" Adjusted Date ") Date_add ($ss= "string" > "eventDateTime", 1)) ss= "> .show ()

Months_add

Like date_add and date_sub, this feature helps to add months.

Subtract months, the number of months to be subtracted is set to negative, because there is no separate subtraction function to subtract months

Example:

Add and subtract one month from eventDateTime.

Ss= "dp-sql" > ss= "alt" > var df = Seq (ss= "> (1, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-01 23:00:01 ")), ss=" alt "> (2, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-02 12:40:32 "), ss=" > (3, ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-03 09:54:00")) Ss= "alt" > (4, ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-04 10:12:43") ss= ">) .toDF (ss=" string ">" typeId ", ss=" string ">" eventDateTime ") ss=" alt "> / / ss=" keyword "> To ss=" keyword "> add one months ss=" > df ss= "alt" > .withColumn (ss= "string" > "Adjusted Date", add_months ($ss= "string" > "eventDateTime") 1) ss= "" > .show () ss= "alt" > / / ss= "keyword" > To subtract one months ss= "> df ss=" alt "> .withcolumn (ss=" string ">" Adjusted Date ", add_months ($ss=" string ">" eventDateTime ",-1)) ss=" > .show ()

Zip_with

Merge left and right arrays by applying functions.

This function expects both arrays to be the same length, and if one array is shorter than the other, null is added to match the longer array length.

Example:

Add and merge the contents of two arrays into one

Ss= "dp-sql" > ss= "alt" > val df = Seq ss= "> .toDF (ss=" string ">" array_1 ", ss=" string ">" array_2 ") ss=" alt "> ss=" > df ss= "alt > .withColumn (ss=" string ">" merged_array ", zip_with ($ss=" string ">" array_1 ", $ss=" string ">" array_2 ", x Y) = > (xroomy)) ss= "" > .show

Apply the predicate to all elements and check whether at least one or more elements in the array are true for the predicate function.

Example:

Check whether at least one element in the array is even.

Ss= "dp-sql" > ss= "alt" > val df= Seq (Seq (2jung 4pens 6), Seq (5pje 10pas 3) .toDF (ss= "string" > "num_array") ss= "> df.withColumn (ss=" string ">" flag ", exists ($ss=" string ">" num_array ", x = > lit (x% 2pas)) ss=" alt ">. Show

Filter

Applies the given predicate to all elements in the array and filters out elements whose predicate is true.

Example:

Filter out only the even elements in the array.

Ss= "dp-sql" > ss= "alt" > val df = Seq (Seq (2pje 4pc6), Seq (5pje 10pas 3) .toDF (ss= "string" > "num_array") ss= "> df.withColumn (ss=" string ">" even_array ", filter ($ss=" string ">" num_array ", x = > lit (x% 2fetch 0)) ss=" alt "> show.

Aggregate aggregate

Use the given function to simplify the given array and another value / state to a single value, and apply the optional finish function to convert the reduced value to another state / value.

Example:

Add 10 to the sum of the array and multiply the result by 2

Ss= "dp-sql" > ss= "alt" > val df = Seq ((Seq (2pime 4pyr6), 3), (Seq (5page10pence3), 8)) ss= "> .toDF (ss=" string ">" num_array ", ss=" string ">" constant ") ss=" alt "> df.withColumn (ss=" string ">" reduced_array ", aggregate ($ss=" string ">" num_array ", $ss=" string ">" constant ", (xMagney) = > xonomy) X = > Spark SQL 2)) features introduced for Spark SQL mode in ss= "" > .showSpark 3.0

Here are the new SQL functions that you can only use in Spark SQL mode.

Acosh

Finds the reciprocal of the hyperbolic cosine of a given expression.

Asinh

Find the inverse of the hyperbolic sine of a given expression.

Atanh

Finds the inverse of the hyperbolic tangent of a given expression.

Bit_and,bit_or and bit_xor

Calculate bitwise AND,OR and XOR values

Bit_count

Returns the number of bits of the count.

Bool_and and bool_or

Verify that all values of the expression are true or that at least one of the expressions is true.

Count_if

Returns the number of true values in a column

Example:

Find out the even values in a given column

Ss= "dp-sql" > ss= "alt" > var df = Seq ((1), (2), (4)) .toDF (ss= "string" > "num") ss= "> ss=" alt "> df.createOrReplaceTempView (ss=" string ">" table ") ss=" > spark.sql (ss= "string" > select count_if (num% 2 cycles 0) from table ") show.

Date_part

Extract part of the date / time stamp, such as hours, minutes, etc.

Div

Used to separate an expression or a column with another expression / column

Every and sum

If the given expression evaluates to true for all column values of each column, and at least one value evaluates to true for some values, this function returns true.

Make_date,make_interval and make_timestamp

Construct date, timestamp and specific interval.

Example:

Ss= "dp-sql" > ss= "alt" > ss= "keyword" > SELECT make_timestamp (2020, 01, 7, 30, 45.887)

Max_by and min_by

Compare two columns and return the value of the left column associated with the maximum / minimum value of the right column

Example:

Ss= "dp-sql" > ss= "alt" > var df = Seq ((1p1), (2p1), (4p3)) .toDF (ss= "string" > "x", ss= "string" > "y") ss= "> ss=" alt "> df.createOrReplaceTempView (ss=" string ">" table ") ss=" > spark.sql (ss= "string" > "select max_by (xQuery) from table") show.

Types

Returns the data type of the column value

Edition

Returns the Spark version and its git version

Justify_days,justify_hours and justify_interval

The newly introduced alignment feature is used to adjust the time interval.

Example:

30 days is a month

Ss= "dp-sql" > ss= "alt" > ss= "keyword" > SELECT justify_days (interval ss= "string" >'30 day')

Partition conversion function

Starting with Spark 3.0 and later, there are new features that help you partition data, which I'll cover in another article.

Overall, we have analyzed all the data conversion and analysis capabilities, which are the sparks of version 3.0. I hope this guide will help you understand these new features. These features will certainly accelerate spark development and help build solid and effective spark pipes.

After reading the above, do you have any further understanding of the new features of Spark 3.0? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.