Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to customize the function Application of Spark Hive

2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Share

Shulou(Shulou.com)06/01 Report--

This article mainly introduces Spark Hive how to customize the function application, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.

1. Brief introduction

Spark currently supports three types of custom functions for UDF,UDTF,UDAF. UDF usage scenario: enter a line and return a result, one-to-one, such as defining a function. The function is to enter an IP address and return a corresponding province. UDTF usage scenario: enter one line and return multiple lines (hive), one-to-many, but there is no UDTF in sparkSQL. You can use flatMap in spark to achieve this function. UDAF: enter multiple lines, return one line, aggregate (mainly used for aggregation functions such as groupBy,count,sum), these are the aggregate functions that come with spark, but the complexity is relatively complex.

The underlying layer of Spark actually encapsulates a function in CatalogFunction structure, where FunctionIdentifier describes basic information such as the name of the function, and FunctionResource describes the file type (jar or file) and file path Spark's SessionCatalog provides function registration, deletion, acquisition and other serial APIs. When Spark's Executor receives a function execution sql request, through the cached CatalogFunction information, it finds the corresponding jar address in CatalogFunction and ClassName, JVM dynamically loads jar, and executes the corresponding function through ClassName reflection.

Figure 1. CatalogFunction structure

Figure 2. Register load function logic

Hive's HiveSessionCatalog is the SessionCatalog that inherits Spark. It decorates the basic functions of Spark to adapt to the basic functions of Hive, including functional functions. HiveSimpleUDF corresponds to UDF,HiveGenericUDF corresponds to GenericUDF,HiveUDAFFunction corresponds to AbstractGenericUDAFResolve and UDAF,HiveGenericUDTF corresponds to GenericUDTF.

Figure 3. Hive decorates spark function logic

2. UDF

UDF is the most commonly used function, which is relatively easy to use. It is mainly divided into two types of UDF: simple data types, inheriting UDF interface, and complex data types, such as Map,List,Struct data types, inheriting GenericUDF interface.

When a simple type implements UDF, you can customize the methods, parameters and return types of several names evaluate as needed. Because the UDF interface defaults to DefaultUDFMethodResolver to the method parser to get the method, the parser reflects and finds the method metadata based on the user input parameters and the written name evaluate. Of course, users can also customize the parsing method of the parser.

Figure 4. Simple example of customizing UDF

Figure 5. Default UDF method parser

3. UDAF

UDAF is an aggregate function. At present, there are three main ways to implement it: the implementation of UDAF interface, which has been abandoned; the implementation of UserDefinedAggregateFunction, which is generally used to aggregate data by stages; the realization of AbstractGenericUDAFResolver, which is slightly more complex than UserDefinedAggregateFunction, and the need to implement a calculator Evaluator (such as the general calculator GenericUDAFEvaluator). The logic processing of UDAF mainly occurs in Evaluator.

UserDefinedAggregateFunction defines the input and output data structure, implements initialization buffer (initialize), aggregates single data (update), aggregates cache (merge), and calculates the final result (evaluate).

Figure 6. Simple example of customizing UDAF

4. UDTF

UDTF is simply and roughly understood as an automatic function that generates multiple rows per row, which can generate multiple rows and columns, also known as table-generated functions. At present, the implementation is to implement the GenericUDTF interface, to achieve two interfaces, initialize interface parameter check, column definition, process interface to accept one line of data, cut data.

Figure 7. Simple example of customizing UDTF

Thank you for reading this article carefully. I hope the article "how to customize the function Application of Spark Hive" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Internet Technology

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report