In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
Today, I would like to talk to you about the new features of Spark 3.0. maybe many people don't know much about it. In order to make you understand better, the editor summed up the following for you. I hope you can get something from this article.
Recently, the Apache Spark community released a preview of Spark 3.0, which contains many important new features that will help Spark create a powerful impact, in this era of big data and data science, the product already has a wide range of enterprise users and developers.
In the new version, the Spark community has migrated some features from Spark SQL to programmatic Scala API (
Org.apache.spark.sql.functions), which encourages developers to use this functionality directly as part of their DataFrame transformation, rather than typing directly into SQL mode or creating views, and using this function as well as SQL expressions or callUDF functions.
The community has also laboriously introduced new data conversion capabilities and partition_transforms functions that are useful when used with Spark's new DataFrameWriterv2 to write data to some external storage.
Some of the new features in Spark 3 are already part of previous versions of Databricks Spark. Therefore, if you work in the Databricks cloud, you may find some familiar features.
The following describes the new Spark features in Spark SQL and Scala API for DataFrame operation access, as well as the ability to migrate from Spark SQL to Scala API for programmatic access.
Features introduced in Spark 3.0 in Spark SQL and features for DataFrame transformation
From_csv
Like from_json, this function parses the column that contains the CSV string and converts it to type Struct. If the CSV string cannot be parsed, null is returned.
Example:
This function requires a Struct pattern and options that indicate how to parse CSV strings. The option is the same as the CSV data source.
Ss= "dp-sql" > ss= "alt" > val studentInfo = ss= "string" > "1Nil ss=": ss= "string" > "2Nil ss=": ss= "string" > "3Nil ss=" > val ss= "keyword" > schema = new StructType () ss= "alt" > .ss = "keyword" > add (ss= "string" > "Id", IntegerType) ss= "> .ss =" keyword "> add (ss=" string ">" Name ") StringType) ss= "alt" > .ss = "keyword" > add (ss= "string" > "Dept", StringType) ss= "" > val options = Map (ss= "string" > "delimiter"-> ss= "string" > ",") ss= "alt" > val studentDF = studentInfo.toDF (ss= "string" > "Student_Info") ss= "> .withColumn (ss=" string ">" csv_struct ", from_csv ('Student_Info, ss= keyword" > schema,options) ss= "alt" > studentDF.show ()
To_csv
To convert the structure Type column to an CSV string.
Example:
Along with the Struct type column, this function also accepts an optional options parameter that indicates how to convert the Struct column to a CSV string.
Ss= "dp-sql" > ss= "alt" > studentDF ss= "> .withColumn (ss=" string ">" csv_string ", to_csv ($ss=" string ">" csv_struct ", Map.empty [String, String] .asJava) ss=" alt "> .show
Infers the pattern of the CSV string and returns the pattern in DDL format.
Example:
This function requires a CSV string column and an optional argument that contains options for how to parse the CSV string.
Ss= "dp-sql" > ss= "alt" > studentDF ss= "" > .withColumn (ss= "string" > "schema", schema_of_csv (ss= "string" > "csv_string")) ss= "alt" > .show
For_all
Applies the given predicate to all elements in the array and returns true only if all elements in the array evaluate to true, otherwise it returns false.
Example:
Check that all elements in the given Array column are even.
Ss= "dp-sql" > ss= "alt" > val df = Seq (Seq (2pjung 4pc6), Seq (5pje 10p3) .toDF (ss= "string" > "int_array") ss= "> df.withColumn (ss=" string ">" flag ", forall ($ss=" string ">" int_array ", (XRV ssss =" keyword "> Column) = > (lit (x% 2pyr0) ss=" alt ">. Show
Transform
After applying the function to all elements in the array, a new array is returned.
Example:
Add "1" to all elements in the array.
Ss= "dp-sql" > ss= "alt" > val df = Seq ((Seq (2pje 4dagai 6)), (Seq (5je 10pai3)) .toDF (ss= "string" > "num_array") ss= "> df.withColumn (ss=" string ">" num_array ", transform ($ss=" string ">" num_array ", x = > xanth1). Show)
Overlay
To replace the contents of a column, use the actual replacement from the specified byte position to the optional specified byte length.
Example:
Change the greeting of a specific person to the traditional "Hello World"
Here we replace the name with the world, because the starting position of the name is 7, and we want to delete the full name before replacing the content, the length of the byte position to be deleted should be greater than or equal to the length of the name in the maximum column.
Therefore, we pass the replacement word "world", replace the content with the specific starting position of "7", and remove "12" from the specified starting position (if not specified, the position is optional, the function will only replace the source content with the replacement content from the specified starting position).
The override replaces the content in StringType,TimeStampType,IntegerType, etc. However, the return type of Column will always be StringType, regardless of the Column input type.
Ss= "dp-sql" > ss= "alt" > val greetingMsg = ss= "string" > "Hello Arun":: ss= "string" > "Hello Mohit Chawla":: ss= "string" > "Hello Shaurya":: Nil ss= "> val greetingDF = greetingMsg.toDF (ss=" string ">" greet_msg ") ss=" alt "> greetingDF.withColumn (ss=" string ">" greet_msg ", overlay ($ss=" string ">" greet_msg ", lit (ss=" string ">" World "), lit (ss=" ss= ">" 7 ") Lit (ss= "string" > "12")) ss= "> .show
Split
Splits the string expression based on the given regular expression and the specified limit, which indicates the number of times the regular expression is applied to the given string expression.
If the specified limit is less than or equal to zero, the regular expression is applied multiple times on the string, and the resulting array contains all possible string splits based on the given regular expression.
If the specified limit is greater than zero, a regular expression that does not exceed that limit is used
Example:
Splits a given string expression into two based on a regular expression. The string delimiter.
Ss= "dp-sql" > ss= "alt" > val num = ss= "string" > "one~two~three":: ss= "string" > "four~five":: Nil ss= "> val numDF = num.toDF (ss=" string ">" numbers ") ss=" alt "> numDF ss=" > .withColumn (ss= "string" > "numbers", split ($ss= "string" > "numbers", ss= "string" > "~", 2)) ss= "alt" > show.
Divide the same string expression into multiple parts until a delimiter appears
Ss= "dp-sql" > ss= "alt" > numDF ss= "> .withColumn (ss=" string ">" numbers ", split ($ss=" string ">" numbers ", ss=" string ">" ~ ", 0)) ss=" alt ">. Show
Map_entries
Converts the mapped key value to an unordered array of entries.
Example:
Gets all the keys and values of Map in the array.
Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (1-> ss= "string" > "x", 2-> ss= "string" > "y") .toDF (ss= "string" > "key_values") ss= "> df.withColumn (ss=" string ">" key_value_array ", map_entries ($ss=" string ">" key_values ") ss=" alt "> show.
Map_zip_with
Use the function to merge two Map into one according to the key.
Example:
To calculate the total sales of cross-departmental employees, and by passing a function, this function summarizes the total sales from two different Map columns based on the key, thus obtaining the total sales of a particular employee in a single map.
Ss= "dp-sql" > ss= "alt" > val df = Seq ((Map (ss= "string" > "EID_1"-> 10000 string ss= "string" > "EID_2"-> 25000), ss= "" > Map (ss= "string" > "EID_1"-> 1000pint SS = "string" > "EID_2"-> 2500)) .toDF (ss= "string" > "emp_sales_dept1", ss= "string" > "emp_sales_dept2") ss= "alt" > ss= "> df. Ss= "alt" > withColumn (ss= "string" > "total_emp_sales", map_zip_with ($ss= "string" > "emp_sales_dept1", $ss= "string" > "emp_sales_dept2", (kmeme v1 string v2) = > (v1+v2)) ss= "> .show
Map_filter
Returns a new key-value pair that contains only Map values that satisfy the given predicate function.
Example:
Only employees with a sales value of more than 20000 are screened.
Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (ss= "string" > "EID_1"-> 10000 string ss= "string" > "EID_2"-> 25000)) ss= "> .toDF (ss=" string ">" emp_sales ") ss=" alt "> ss=" > df ss= "alt" > .withColumn (ss= "string" > "filtered_sales", map_filter ($ss= "string" > emp_sales ", (KMAV) = > (v > 20000)) ss=" > show.
Transform_values
Manipulates the values of all elements in the Map column according to the given function.
Example:
Calculate an employee's salary by giving each employee a 5000 raise
Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (ss= "string" > "EID_1"-> 10000 string ss= "string" > "EID_2"-> 25000)) ss= "> .toDF (ss=" string ">" emp_salary ") ss=" alt "> ss=" > df ss= "alt" > .withColumn (ss= "string" > "emp_salary", transform_values ($ss= "string" > emp_salary ", (KMJ v) = > (vault 5000)) ss=" > show.
Transform_keys
Manipulates the keys of all elements in the Map column according to the given function.
Example:
To add the company name "XYZ" to the employee number.
Ss= "dp-sql" > ss= "alt" > val df = Seq (Map (ss= "string" > "EID_1"-> 10000, ss= "string" > "EID_2"-> 25000)) ss= "> .toDF (ss=" string ">" employees ") ss=" alt "> df ss=" > .withColumn (ss= "string" > "employees", transform_keys ($ss= "string" > "employees", (k, v) = > concat (k) Lit (ss= "string" > "_ XYZ") ss= "alt" > .show
Xhash74
To calculate the hash code for the contents of a given column, use the 64-bit xxhash algorithm and return the result as long.
The function of DataFrame conversion from Spark SQL to Scala API in Spark 3.0
Scala API can use most Spark SQL functions, which can use the same function as part of a DataFrame operation. However, there are still some features that cannot be used as programming functions. To use these features, you must enter Spark SQL mode and use them as part of an SQL expression, or use the Spark "callUDF" feature to use the same functionality. With the popularity and use of functions, some of these functions have been transplanted to new versions of programmed Spark API in the past. The following is migrated from the previous version of the Spark SQL function to Scala API
Function of org.spark.apache.sql.functions)
Date_sub
Subtract the number of days from the date, time stamp, and string data types. If the data type is a string, its format should be convertible to date "yyyy-MM-dd" or "yyyy-MM-dd HH:mm:ss.ssss"
Example:
Subtract "1 day" from eventDateTime.
If the number of days to be subtracted is negative, this feature adds the given number of days to the actual date.
Ss= "dp-sql" > ss= "alt" > var df = Seq (ss= "> (1, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-01 23:00:01 ")), ss=" alt "> (2, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-02 12:40:32 "), ss=" > (3) Ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-03 09:54:00"), ss= "alt" > (4, ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-04 10:12:43") ss= ">) ss=" alt "> .toDF (ss=" string ">" typeId ", ss=" string ">" eventDateTime ") ss=" > ss= "alt > df.withColumn (ss=" string ">" Adjusted_Date ") Date_sub ($ss= "string" > "eventDateTime", 1)) .show
Date_add
Same as date_sub, but days are added to the actual number of days.
Example:
Add "1 day" to eventDateTime
Ss= "dp-sql" > ss= "alt" > var df = Seq (ss= "> (1, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-01 23:00:01 ")), ss=" alt "> (2, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-02 12:40:32 "), ss=" > (3) Ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-03 09:54:00"), ss= "alt" > (4, ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-04 10:12:43") ss= ">) ss=" alt "> .toDF (ss=" string ">" Id ", ss=" string ">" eventDateTime ") ss=" > df ss= "alt > .withColumn (ss=" string ">" Adjusted Date ") Date_add ($ss= "string" > "eventDateTime", 1)) ss= "> .show ()
Months_add
Like date_add and date_sub, this feature helps to add months.
Subtract months, the number of months to be subtracted is set to negative, because there is no separate subtraction function to subtract months
Example:
Add and subtract one month from eventDateTime.
Ss= "dp-sql" > ss= "alt" > var df = Seq (ss= "> (1, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-01 23:00:01 ")), ss=" alt "> (2, ss=" keyword "> Timestamp.valueOf (ss=" string ">" 2020-01-02 12:40:32 "), ss=" > (3, ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-03 09:54:00")) Ss= "alt" > (4, ss= "keyword" > Timestamp.valueOf (ss= "string" > "2020-01-04 10:12:43") ss= ">) .toDF (ss=" string ">" typeId ", ss=" string ">" eventDateTime ") ss=" alt "> / / ss=" keyword "> To ss=" keyword "> add one months ss=" > df ss= "alt" > .withColumn (ss= "string" > "Adjusted Date", add_months ($ss= "string" > "eventDateTime") 1) ss= "" > .show () ss= "alt" > / / ss= "keyword" > To subtract one months ss= "> df ss=" alt "> .withcolumn (ss=" string ">" Adjusted Date ", add_months ($ss=" string ">" eventDateTime ",-1)) ss=" > .show ()
Zip_with
Merge left and right arrays by applying functions.
This function expects both arrays to be the same length, and if one array is shorter than the other, null is added to match the longer array length.
Example:
Add and merge the contents of two arrays into one
Ss= "dp-sql" > ss= "alt" > val df = Seq ss= "> .toDF (ss=" string ">" array_1 ", ss=" string ">" array_2 ") ss=" alt "> ss=" > df ss= "alt > .withColumn (ss=" string ">" merged_array ", zip_with ($ss=" string ">" array_1 ", $ss=" string ">" array_2 ", x Y) = > (xroomy)) ss= "" > .show
Apply the predicate to all elements and check whether at least one or more elements in the array are true for the predicate function.
Example:
Check whether at least one element in the array is even.
Ss= "dp-sql" > ss= "alt" > val df= Seq (Seq (2jung 4pens 6), Seq (5pje 10pas 3) .toDF (ss= "string" > "num_array") ss= "> df.withColumn (ss=" string ">" flag ", exists ($ss=" string ">" num_array ", x = > lit (x% 2pas)) ss=" alt ">. Show
Filter
Applies the given predicate to all elements in the array and filters out elements whose predicate is true.
Example:
Filter out only the even elements in the array.
Ss= "dp-sql" > ss= "alt" > val df = Seq (Seq (2pje 4pc6), Seq (5pje 10pas 3) .toDF (ss= "string" > "num_array") ss= "> df.withColumn (ss=" string ">" even_array ", filter ($ss=" string ">" num_array ", x = > lit (x% 2fetch 0)) ss=" alt "> show.
Aggregate aggregate
Use the given function to simplify the given array and another value / state to a single value, and apply the optional finish function to convert the reduced value to another state / value.
Example:
Add 10 to the sum of the array and multiply the result by 2
Ss= "dp-sql" > ss= "alt" > val df = Seq ((Seq (2pime 4pyr6), 3), (Seq (5page10pence3), 8)) ss= "> .toDF (ss=" string ">" num_array ", ss=" string ">" constant ") ss=" alt "> df.withColumn (ss=" string ">" reduced_array ", aggregate ($ss=" string ">" num_array ", $ss=" string ">" constant ", (xMagney) = > xonomy) X = > Spark SQL 2)) features introduced for Spark SQL mode in ss= "" > .showSpark 3.0
Here are the new SQL functions that you can only use in Spark SQL mode.
Acosh
Finds the reciprocal of the hyperbolic cosine of a given expression.
Asinh
Find the inverse of the hyperbolic sine of a given expression.
Atanh
Finds the inverse of the hyperbolic tangent of a given expression.
Bit_and,bit_or and bit_xor
Calculate bitwise AND,OR and XOR values
Bit_count
Returns the number of bits of the count.
Bool_and and bool_or
Verify that all values of the expression are true or that at least one of the expressions is true.
Count_if
Returns the number of true values in a column
Example:
Find out the even values in a given column
Ss= "dp-sql" > ss= "alt" > var df = Seq ((1), (2), (4)) .toDF (ss= "string" > "num") ss= "> ss=" alt "> df.createOrReplaceTempView (ss=" string ">" table ") ss=" > spark.sql (ss= "string" > select count_if (num% 2 cycles 0) from table ") show.
Date_part
Extract part of the date / time stamp, such as hours, minutes, etc.
Div
Used to separate an expression or a column with another expression / column
Every and sum
If the given expression evaluates to true for all column values of each column, and at least one value evaluates to true for some values, this function returns true.
Make_date,make_interval and make_timestamp
Construct date, timestamp and specific interval.
Example:
Ss= "dp-sql" > ss= "alt" > ss= "keyword" > SELECT make_timestamp (2020, 01, 7, 30, 45.887)
Max_by and min_by
Compare two columns and return the value of the left column associated with the maximum / minimum value of the right column
Example:
Ss= "dp-sql" > ss= "alt" > var df = Seq ((1p1), (2p1), (4p3)) .toDF (ss= "string" > "x", ss= "string" > "y") ss= "> ss=" alt "> df.createOrReplaceTempView (ss=" string ">" table ") ss=" > spark.sql (ss= "string" > "select max_by (xQuery) from table") show.
Types
Returns the data type of the column value
Edition
Returns the Spark version and its git version
Justify_days,justify_hours and justify_interval
The newly introduced alignment feature is used to adjust the time interval.
Example:
30 days is a month
Ss= "dp-sql" > ss= "alt" > ss= "keyword" > SELECT justify_days (interval ss= "string" >'30 day')
Partition conversion function
Starting with Spark 3.0 and later, there are new features that help you partition data, which I'll cover in another article.
Overall, we have analyzed all the data conversion and analysis capabilities, which are the sparks of version 3.0. I hope this guide will help you understand these new features. These features will certainly accelerate spark development and help build solid and effective spark pipes.
After reading the above, do you have any further understanding of the new features of Spark 3.0? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.