How to use udf, udaf and udtf 07/09 Update SLTechnology News&Howtos

How to use udf, udaf and udtf

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Editor to share with you how to use udf, udaf, udtf, I believe that most people do not know much about it, so share this article for your reference, I hope you can learn a lot after reading this article, let's go to know it!

1. UDF

1. Background: Hive is a data warehouse that provides HQL query based on MapReduce in Hadoop. Hive is a very open system, and many contents can be customized by users, including:

A) File format: Text File,Sequence File

B) data formats in memory: Java Integer/String, Hadoop IntWritable/Text

C) map/reduce script provided by the user: using stdin/stdout to transfer data, regardless of language

D) user-defined functions: Substr, Trim, 1-1

E) user-defined aggregate functions: Sum, Average... N-1

2, definition: UDF (User-Defined-Function), user-defined function to deal with the data.

II. Usage

1. The UDF function can be directly applied to the select statement. After formatting the query structure, the content is output.

2. When writing UDF functions, you need to pay attention to the following points:

A) Custom UDF needs to inherit org.apache.hadoop.hive.ql.UDF.

B) need to implement the evaluate function.

C) the evaluate function supports overloading.

3. The following is the UDF of the summation function of two numbers. The evaluate function represents the addition of two integer data, two floating point data, and variable length data.

Package hive.connect;import org.apache.hadoop.hive.ql.exec.UDF;public final class Add extends UDF {public Integer evaluate (Integer a, Integer b) {if (null = = a | | null = = b) {return null;} return a + b;} public Double evaluate (Double a, Double b) {if (a = = null | | b = = null) return null; return a + b;} public Integer evaluate (Integer...) A) {int total = 0; for (int I = 0; I

< a.length; i++) if (a[i] != null) total += a[i]; return total; }}4、步骤 a）把程序打包放到目标机器上去； b）进入hive客户端，添加jar包：hive>

Add jar/ run/jar/udf_test.jar

C) create a temporary function: hive > CREATE TEMPORARY FUNCTION add_example AS 'hive.udf.Add'

D) query HQL statement:

SELECT add_example (8,9) FROM scores

SELECT add_example (scores.math, scores.art) FROM scores

SELECT add_example (6,7,8,6.8) FROM scores

E) destroy temporary function: hive > DROP TEMPORARY FUNCTION add_example

5. When using UDF, the details will be converted automatically, for example:

SELECT add_example (8pc9.1) FROM scores

The result is that 17.1 Int UDFs convert parameters of type Int to double. The type of diet conversion is controlled by UDFResolver.

III. UDAF

1. When querying data in Hive, some clustering functions are not included in HQL and need to be implemented by users.

2. User-defined aggregate functions: Sum, Average... N-1

UDAF (User- Defined Aggregation Funcation)

IV. Usage

1. The following two packages are required, import org.apache.hadoop.hive.ql.exec.UDAF and org.apache.hadoop.hive.ql.exec.UDAFEvaluator.

2. The function class needs to inherit the UDAF class, and the inner class Evaluator real UDAFEvaluator interface.

3. Evaluator needs to implement init, iterate, terminatePartial, merge and terminate functions.

A) the init function implements the init function of the interface UDAFEvaluator.

B) iterate receives the incoming parameters and rotates internally. Its return type is boolean.

C) terminatePartial has no parameters, which means that the rotation data is returned after the rotation of the iterate function ends. TerminatePartial is similar to the Combiner of hadoop.

D) merge receives the return result of terminatePartial and performs data merge operation, and its return type is boolean.

E) terminate returns the final aggregate function result.

4. The following is an average UDAF:

Package hive.udaf;import org.apache.hadoop.hive.ql.exec.UDAF;import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;public class Avg extends UDAF {public static class AvgState {private long mCount; private double mSum;} public static class AvgEvaluator implements UDAFEvaluator {AvgState state; public AvgEvaluator () {super (); state = new AvgState (); init () The} / * init function is similar to the constructor, which is used to initialize * / public void init () {state.mSum = 0; state.mCount = 0;} / * iterate to receive the passed parameters and perform internal rotation. The return type is boolean * * @ param o * @ return * / public boolean iterate (Double o) {if (o! = null) {state.mSum + = o; state.mCount++;} return true } / * terminatePartial has no parameters. It returns the rotation data after the rotation of the iterate function ends. * terminatePartial is similar to hadoop's Combiner * * @ return * / public AvgState terminatePartial () {/ / combiner return state.mCount = = 0? Null: state;} / * merge receives the returned result of terminatePartial and performs data merge operation. The return type is boolean * * @ param o * @ return * / public boolean merge (AvgState o) {if (o! = null) {state.mCount + = o.mCount; state.mSum + = o.mSumm;} return true } / * terminate returns the final aggregate function result * * @ return * / public Double terminate () {return state.mCount = = 0? Null: Double.valueOf (state.mSum / state.mCount);} 5. Execute the steps to find the average function

A) compile the java file into Avg_test.jar.

B) enter the hive client and add the jar package:

Hive > add jar/ run/jar/Avg_test.jar.

C) create a temporary function:

Hive > create temporary function avg_test 'hive.udaf.Avg'

D) query statement:

Hive > select avg_test (scores.math) from scores

E) destroy temporary functions:

Hive > drop temporary function avg_test

V. Summary

1. Overload the evaluate function.

2. The parameter type in the UDF function can be Writable or the basic data object in java.

3. UDF supports longer parameters.

4. Hive supports implicit type conversion.

5. When the client exits, the temporary function created is automatically destroyed.

6. The evaluate function must return a type value. If it is empty, it must return null. It cannot be of type void.

7. UDF is a calculation operation based on the columns of a single record, while UDFA is a user-defined clustering function, which is a calculation operation based on all records of the table.

8. Both UDF and UDAF can be overloaded.

9. View function

SHOW FUNCTIONS

DESCRIBE FUNCTION

UDTF steps:

1. Must inherit org.apache.hadoop.hive.ql.udf.generic.GenericUDTF

two。 Implement three methods: initialize, process and close

3.UDTF will first

a. Call the initialize method, which returns information about the returned rows of UDTF (number of returns, type)

b. After initialization, the process method is called to process the passed parameters, and the result can be returned through the forword () method.

c. Finally, the close () method is called to clean up the methods that need to be cleaned.

Java code

Public class GenericUDTFExplode extends GenericUDTF {

Private ListObjectInspector listOI = null

@ Override

Public void close () throws HiveException {

}

@ Override

Public StructObjectInspector initialize (ObjectInspector [] args) throws UDFArgumentException {

If (args.length! = 1) {

Throw new UDFArgumentException ("explode () takes only one argument")

}

If (args [0] .getCategory ()! = ObjectInspector.Category.LIST) {

Throw new UDFArgumentException ("explode () takes an array as a parameter")

}

ListOI = (ListObjectInspector) args [0]

ArrayList fieldNames = new ArrayList ()

ArrayList fieldOIs = new ArrayList ()

FieldNames.add ("col")

FieldOIs.add (listOI.getListElementObjectInspector ())

Return ObjectInspectorFactory.getStandardStructObjectInspector (fieldNames

FieldOIs)

}

Private final Object [] forwardObj = new Object [1]

@ Override

Public void process (Object [] o) throws HiveException {

List list = listOI.getList (o [0])

If (list = = null) {

Return

}

For (Object r: list) {

ForwardObj [0] = r

Forward (forwardObj)

}

@ Override

Public String toString () {

Return "explode"

}

10. Wiki link: http://wiki.apache.org/hadoop/Hive/LanguageManual/UDF

Deploying Jars for User Defined Functions and User Defined SerDes

In order to start using your UDF, you first need to add the code to the classpath:

Hive > add jar my_jar.jar

Added my_jar.jar to class path

By default, it will look in the current directory. You can also specify a full path:

Hive > add jar / tmp/my_jar.jar

Added / tmp/my_jar.jar to class path

Your jar will then be on the classpath for all jobs initiated from that session. To see which jars have been added to the classpath you can use:

Hive > list jars

My_jar.jar

See Hive CLI for full syntax and more examples.

As of Hive 0.13, UDFs also have the option of being able to specify required jars in the CREATE FUNCTION statement:

CREATE FUNCTION myfunc AS 'myclass' USING JAR' hdfs:///path/to/jar'

This will add the jar to the classpath as if ADD JAR had been called on that jar.

The above is all the contents of the article "how to use udf, udaf and udtf". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.