In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-28 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces "what is the method of Spark UDF variable length parameter". In daily operation, I believe that many people have doubts about the method of Spark UDF variable length parameter. Xiaobian consulted all kinds of data and sorted out simple and easy-to-use operation methods. I hope it will be helpful for you to answer the doubt of "what is the method of Spark UDF variable length parameter?" Next, please follow the editor to study!
Introduction
Variable length parameters are not new to us, as we write in Java
Public void varArgs (String... Args)
That's what we write in Scala.
Def varArgs (cols: String*): String
In Spark, many times we have our own business logic, off-the-shelf functions can not meet our needs, and when we need to deal with multiple columns in the same row, merge them into one column through our own logic, variable length parameters and their variants can help us.
However, we cannot use variable length parameters to pass values in Spark UDF, but the reason why this article begins with a variable length parameter is because the requirement arises from it, and by transforming it, we can use variable length parameters or Seq types to receive parameters.
The following is demonstrated through Spark-Shell. The following three methods can be used to transfer parameters in multiple columns, respectively
Variable length parameter (accepts array type)
Seq type parameter (accepts array type)
Row type parameter (accepts struct type)
UDF of variable length parameter type
Define the UDF method
Def myConcatVarargs (sep: String, cols: String*): String = cols.filter (_! = null) .mkString (sep)
Register the UDF function
Since variable length parameters can only be defined by methods, some application functions are used to convert them.
Val myConcatVarargsUDF = udf (myConcatVarargs _)
You can see that the definition of the UDF is as follows
UserDefinedFunction (, StringType,List (StringType, ArrayType (StringType,true)
That is, the variable length parameter is converted to ArrayType, and the function contains only two parameters, so the variable length parameter list can also be seen that it cannot be used.
Variable length parameter list passing values
We construct a DataFrame as follows
Val df = sc.parallelize (Array (("aa", "bb", "cc"), ("dd", "ee", "ff")) .toDF ("A", "B", "C")
Then pass in multiple columns of type String directly to myConcatVarargsUDF
Df.select (myConcatVarargsUDF (lit ("-"), col ("A"), col ("B"), col ("C")) .show
The result is an error as follows
Java.lang.ClassCastException: anonfun$1 cannot be cast to scala.Function4
As you can see, the use of variable-length parameter lists Spark is not supported, it will be recognized as a function of four parameters, and UDF is defined as a function of two parameters rather than four parameters!
Transformation: using array () conversion as the second parameter
We use array () function provided by Spark to convert the parameter to Array type
Df.select (myConcatVarargsUDF (lit ("-"), array (col ("A"), col ("B"), col ("C") .show
The results are as follows
+-+ | UDF (-, array (AMagna BMague C)) | +-+ | aa-bb-cc | | dd-ee-ff | +-+
It can be seen that the UDF method constructed by variable length parameters can pass parameters by constructing Array to achieve the purpose of merging multiple columns.
UDF using the Seq type parameter
As mentioned above, the variable length parameter * * has been converted to ArrayType, so I can't help but wonder why we don't use Array or List type?
In fact, in UDF, we can't define types at will, for example, we can't use List and Array, and we can't define our own types, because it involves serialization and deserialization of data.
Errors taking Array/List as an example
Here is an example of the Array type
Define function
Val myConcatArray = (cols: Array [String], sep: String) = > cols.filter (_! = null) .mkString (sep)
Register for UDF
Val myConcatArrayUDF = udf (myConcatArray)
You can see that the UDF signature given is
UserDefinedFunction (, StringType,List ())
Apply UDF
Df.select (myConcatArrayUDF (array (col ("A"), col ("B"), col ("C")), lit ("-")). Show
Will find that the error is reported.
Scala.collection.mutable.WrappedArray$ofRef cannot be cast to [Ljava.lang.String
Similarly, List will report an error as a parameter type, because objects cannot be built during deserialization, so List and Array cannot be directly used as parameter types of UDF
Use Seq as the parameter type
The definition calls are as follows
Val myConcatSeq = (cols: Seq [Any], sep: String) = > cols.filter (_! = null) .mkString (sep) val myConcatSeqUDF = udf (myConcatSeq) df.select (myConcatSeqUDF (array (col ("A"), col ("B"), col ("C"), lit ("-"). Show
The results are as follows
+-+ | UDF (array, -) | +-+ | aa-bb-cc | | dd-ee-ff | +-+
UDF using the Row type parameter
We can use the struct method in Spark functions to construct the structure type to pass parameters, and then connect the parameters of UDF with the Row type to achieve the purpose of multi-column values.
Def myConcatRow: (Row, String) = > String) = (row, sep) = > row.toSeq.filter (_! = null) .mkString (sep) val myConcatRowUDF = udf (myConcatRow) df.select (myConcatRowUDF (struct (col ("A"), col ("B"), col ("C")), lit ("-"). Show
You can see that the signature of UDF is as follows
UserDefinedFunction (, StringType,List ())
The results are as follows
+-+ | UDF (struct, -) | +-+ | aa-bb-cc | | dd-ee-ff | +-+
Using Row types, you can also use pattern extraction, which is more convenient to use
Row match {case Row (aa:String, bb:Int) = >}
***
For the above three methods, variable length parameters and Seq type parameters require the function of array to be packaged as ArrayType, while if you use the Row type, you need the struct function to build the structure type, which is actually for data serialization and deserialization. Among the three methods, Row is more flexible and reliable, supports different types and can explicitly use pattern extraction, which is quite convenient to use.
From this, we can see that UDF does not support parameters of List and Array types, and custom parameter types cannot be used as parameter types in UDF if they are not serialized and deserialized without the characteristics of mixed Spark. Of course, the Seq type is OK, and you can pass values in a multi-column array.
In addition, we can also use Corialization to achieve the purpose of passing parameters in multiple columns, but different UDF needs to be defined for different number of parameters.
At this point, the study of "what is the method of Spark UDF variable length parameter" is over. I hope to be able to solve your doubts. The collocation of theory and practice can better help you learn, go and try it! If you want to continue to learn more related knowledge, please continue to follow the website, the editor will continue to work hard to bring you more practical articles!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.