Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the method of Spark UDF variable length parameter

2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces "what is the method of Spark UDF variable length parameter". In daily operation, I believe that many people have doubts about the method of Spark UDF variable length parameter. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "what is the method of Spark UDF variable length parameter?" Next, please follow the editor to study!

Introduction

Variable length parameters are not new to us, as we write in Java

Public void varArgs (String... Args)

That's what we write in Scala.

Def varArgs (cols: String*): String

In Spark, many times we have our own business logic, off-the-shelf functions can not meet our needs, and when we need to deal with multiple columns in the same row, merge them into one column through our own logic, variable length parameters and their variants can help us.

However, we cannot use variable length parameters to pass values in Spark UDF, but the reason why this article begins with a variable length parameter is because the requirement arises from it, and by transforming it, we can use variable length parameters or Seq types to receive parameters.

The following is demonstrated through Spark-Shell. The following three methods can be used to transfer parameters in multiple columns, respectively

Variable length parameter (accepts array type)

Seq type parameter (accepts array type)

Row type parameter (accepts struct type)

UDF of variable length parameter type

Define the UDF method

Def myConcatVarargs (sep: String, cols: String*): String = cols.filter (_! = null) .mkString (sep)

Register the UDF function

Since variable length parameters can only be defined by methods, some application functions are used to convert them.

Val myConcatVarargsUDF = udf (myConcatVarargs _)

You can see that the definition of the UDF is as follows

UserDefinedFunction (, StringType,List (StringType, ArrayType (StringType,true)

That is, the variable length parameter is converted to ArrayType, and the function contains only two parameters, so the variable length parameter list can also be seen that it cannot be used.

Variable length parameter list passing values

We construct a DataFrame as follows

Val df = sc.parallelize (Array (("aa", "bb", "cc"), ("dd", "ee", "ff")) .toDF ("A", "B", "C")

Then pass in multiple columns of type String directly to myConcatVarargsUDF

Df.select (myConcatVarargsUDF (lit ("-"), col ("A"), col ("B"), col ("C")) .show

The result is an error as follows

Java.lang.ClassCastException: anonfun$1 cannot be cast to scala.Function4

As you can see, the use of variable-length parameter lists Spark is not supported, it will be recognized as a function of four parameters, and UDF is defined as a function of two parameters rather than four parameters!

Transformation: using array () conversion as the second parameter

We use array () function provided by Spark to convert the parameter to Array type

Df.select (myConcatVarargsUDF (lit ("-"), array (col ("A"), col ("B"), col ("C") .show

The results are as follows

+-+ | UDF (-, array (AMagna BMague C)) | +-+ | aa-bb-cc | | dd-ee-ff | +-+

It can be seen that the UDF method constructed by variable length parameters can pass parameters by constructing Array to achieve the purpose of merging multiple columns.

UDF using the Seq type parameter

As mentioned above, the variable length parameter * * has been converted to ArrayType, so I can't help but wonder why we don't use Array or List type?

In fact, in UDF, we can't define types at will, for example, we can't use List and Array, and we can't define our own types, because it involves serialization and deserialization of data.

Errors taking Array/List as an example

Here is an example of the Array type

Define function

Val myConcatArray = (cols: Array [String], sep: String) = > cols.filter (_! = null) .mkString (sep)

Register for UDF

Val myConcatArrayUDF = udf (myConcatArray)

You can see that the UDF signature given is

UserDefinedFunction (, StringType,List ())

Apply UDF

Df.select (myConcatArrayUDF (array (col ("A"), col ("B"), col ("C")), lit ("-")). Show

Will find that the error is reported.

Scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String

Similarly, List will report an error as a parameter type, because objects cannot be built during deserialization, so List and Array cannot be directly used as parameter types of UDF

Use Seq as the parameter type

The definition calls are as follows

Val myConcatSeq = (cols: Seq [Any], sep: String) = > cols.filter (_! = null) .mkString (sep) val myConcatSeqUDF = udf (myConcatSeq) df.select (myConcatSeqUDF (array (col ("A"), col ("B"), col ("C"), lit ("-"). Show

The results are as follows

+-+ | UDF (array, -) | +-+ | aa-bb-cc | | dd-ee-ff | +-+

UDF using the Row type parameter

We can use the struct method in Spark functions to construct the structure type to pass parameters, and then connect the parameters of UDF with the Row type to achieve the purpose of multi-column values.

Def myConcatRow: (Row, String) = > String) = (row, sep) = > row.toSeq.filter (_! = null) .mkString (sep) val myConcatRowUDF = udf (myConcatRow) df.select (myConcatRowUDF (struct (col ("A"), col ("B"), col ("C")), lit ("-"). Show

You can see that the signature of UDF is as follows

UserDefinedFunction (, StringType,List ())

The results are as follows

+-+ | UDF (struct, -) | +-+ | aa-bb-cc | | dd-ee-ff | +-+

Using Row types, you can also use pattern extraction, which is more convenient to use

Row match {case Row (aa:String, bb:Int) = >}

***

For the above three methods, variable length parameters and Seq type parameters require the function of array to be packaged as ArrayType, while if you use the Row type, you need the struct function to build the structure type, which is actually for data serialization and deserialization. Among the three methods, Row is more flexible and reliable, supports different types and can explicitly use pattern extraction, which is quite convenient to use.

From this, we can see that UDF does not support parameters of List and Array types, and custom parameter types cannot be used as parameter types in UDF if they are not serialized and deserialized without the characteristics of mixed Spark. Of course, the Seq type is OK, and you can pass values in a multi-column array.

In addition, we can also use Corialization to achieve the purpose of passing parameters in multiple columns, but different UDF needs to be defined for different number of parameters.

At this point, the study of "what is the method of Spark UDF variable length parameter" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report