In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-02-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >
Share
Shulou(Shulou.com)06/01 Report--
This article mainly introduces Spark Hive how to customize the function application, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian take you to understand.
1. Brief introduction
Spark currently supports three types of custom functions for UDF,UDTF,UDAF. UDF usage scenario: enter a line and return a result, one-to-one, such as defining a function. The function is to enter an IP address and return a corresponding province. UDTF usage scenario: enter one line and return multiple lines (hive), one-to-many, but there is no UDTF in sparkSQL. You can use flatMap in spark to achieve this function. UDAF: enter multiple lines, return one line, aggregate (mainly used for aggregation functions such as groupBy,count,sum), these are the aggregate functions that come with spark, but the complexity is relatively complex.
The underlying layer of Spark actually encapsulates a function in CatalogFunction structure, where FunctionIdentifier describes basic information such as the name of the function, and FunctionResource describes the file type (jar or file) and file path Spark's SessionCatalog provides function registration, deletion, acquisition and other serial APIs. When Spark's Executor receives a function execution sql request, through the cached CatalogFunction information, it finds the corresponding jar address in CatalogFunction and ClassName, JVM dynamically loads jar, and executes the corresponding function through ClassName reflection.
Figure 1. CatalogFunction structure
Figure 2. Register load function logic
Hive's HiveSessionCatalog is the SessionCatalog that inherits Spark. It decorates the basic functions of Spark to adapt to the basic functions of Hive, including functional functions. HiveSimpleUDF corresponds to UDF,HiveGenericUDF corresponds to GenericUDF,HiveUDAFFunction corresponds to AbstractGenericUDAFResolve and UDAF,HiveGenericUDTF corresponds to GenericUDTF.
Figure 3. Hive decorates spark function logic
2. UDF
UDF is the most commonly used function, which is relatively easy to use. It is mainly divided into two types of UDF: simple data types, inheriting UDF interface, and complex data types, such as Map,List,Struct data types, inheriting GenericUDF interface.
When a simple type implements UDF, you can customize the methods, parameters and return types of several names evaluate as needed. Because the UDF interface defaults to DefaultUDFMethodResolver to the method parser to get the method, the parser reflects and finds the method metadata based on the user input parameters and the written name evaluate. Of course, users can also customize the parsing method of the parser.
Figure 4. Simple example of customizing UDF
Figure 5. Default UDF method parser
3. UDAF
UDAF is an aggregate function. At present, there are three main ways to implement it: the implementation of UDAF interface, which has been abandoned; the implementation of UserDefinedAggregateFunction, which is generally used to aggregate data by stages; the realization of AbstractGenericUDAFResolver, which is slightly more complex than UserDefinedAggregateFunction, and the need to implement a calculator Evaluator (such as the general calculator GenericUDAFEvaluator). The logic processing of UDAF mainly occurs in Evaluator.
UserDefinedAggregateFunction defines the input and output data structure, implements initialization buffer (initialize), aggregates single data (update), aggregates cache (merge), and calculates the final result (evaluate).
Figure 6. Simple example of customizing UDAF
4. UDTF
UDTF is simply and roughly understood as an automatic function that generates multiple rows per row, which can generate multiple rows and columns, also known as table-generated functions. At present, the implementation is to implement the GenericUDTF interface, to achieve two interfaces, initialize interface parameter check, column definition, process interface to accept one line of data, cut data.
Figure 7. Simple example of customizing UDTF
Thank you for reading this article carefully. I hope the article "how to customize the function Application of Spark Hive" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.