Example Analysis of Hudi and Hive Integration Manual in Apache tutorial 07/15 Update SLTechnology News&Howtos

Example Analysis of Hudi and Hive Integration Manual in Apache tutorial

2025-07-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

Editor to share with you the Apache tutorial Hudi and Hive integration manual example analysis, I believe that most people do not know much about it, so share this article for your reference, I hope you will learn a lot after reading this article, let's go to know it!

1. Introduction of Hive external table corresponding to Hudi table

Hudi source table corresponds to a copy of HDFS data, and the data of Hudi table can be mapped to Hive external table through Spark,Flink component or Hudi client. Based on this external table, Hive can easily query real-time view, read optimized view and incremental view.

2. Hive's integration of Hudi

Here, take Hive3.1.1 and Hudi 0.9.0 as examples, other versions are similar

Put hudi-hadoop-mr-bundle-0.9.0xxx.jar and hudi-hive-sync-bundle-0.9.0xx.jar in the lib directory of the hiveserver node

Modify hive-site.xml to find the two configuration items, hive.default.aux.jars.path and hive.aux.jars.path, and add the full path of the jar package in the first step to the configuration: after configuration is as follows

Hive.default.aux.jars.pathxxxx,jar,xxxx,jar, file:///mypath/hudi-hadoop-mr-bundle-0.9.0xxx.jar,file:///mypath/hudi-hive-sync-bundle-0.9.0xx.jar

Restart hive-server after configuration

For Hudi's bootstrap table (tez query), in addition to adding the two jar packages hudi-hadoop-mr-bundle-0.9.0xxx.jar and hudi-hive-sync-bundle-0.9.0xx.jar, you also need to add hbase-shaded-miscellaneous-xxx.jar, hbase-metric-api-xxx.jar,hbase-metrics-xxx.jar, and hbase-protocol-shaded-xx.jar,hbase-shaded-protobuf-xxx.jar,htrce-core4-4.2.0xxxx.jar as described above.

3. Create a hive external table corresponding to the Hudi table

Generally speaking, Hudi tables are automatically synchronized to Hive external tables when writing data with Spark or Flink. In this case, you can query synchronized external tables directly through beeline. If the write engine does not enable automatic synchronization, you need to manually use hudi client tool run_hive_sync_tool.sh to synchronize. For more information, please see the official website to view the relevant parameters.

4. Query the operation premise of Hive external table 4.1 corresponding to Hudi table

Before using Hive to query the Hudi table, you need to set hive.input.format through the set command, otherwise there will be data duplication, query exceptions and other errors. For example, the following error is typically caused by not setting hive.input.format.

Java.lang.IllegalArgumentException: HoodieRealtimeReader can oly work on RealTimeSplit and not with xxxxxxxxxx

In addition, for incremental queries, the set command is required to set 3 additional parameters

Set hoodie.mytableName.consume.mode=INCREMENTAL;set hoodie.mytableName.consume.max.commits=3;set hoodie.mytableName.consume.start.timestamp=commitTime

Note that these three parameters are table-level parameters

The parameter name describes the query mode of the hoodie.mytableName.consume.modeHudi table. Incremental query: INCREMENTAL non-incremental query: do not set or set to SNAPSHOThoodie.mytableName.consume.start.timestampHudi table incremental query start time hoodie. The mytableName.consume.max.commitsHudi table is based on the number of incremental commit queries to be queried after hoodie.mytableName.consume.start.timestamp. The number of submissions, if set to 3, represents the incremental query for data commit 3 times after the specified start time, set to-1, and the incremental query for all data submitted after the specified start time 4.2 COW type Hudi table query

For example, the original table name of Hudi is hudicow, and the name of hive table is hudicow after being synchronized to hive.

4.2.1 COW table real-time view query

After setting hive.input.format to org.apache.hadoop.hive.ql.io.HiveInputFormat or org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat, you can query it like a normal hive table.

Set hive.input.format= org.apache.hadoop.hive.ql.io.HiveInputFormat;select count (*) from hudicow;4.2.2 COW incremental query

In addition to setting hive.input.format, you also need to set the above three incremental query parameters, and the incremental query statement must add the where keyword and filter with _ hoodie_commit_time > 'startCommitTime'. (in this place, the merging of small files of hudi will merge the data of new and old commit into new data, and hive cannot know what is new data and which is old data directly from parquet files.)

Set hive.input.format = org.apache.hadoop.hive.ql.io.HiveInputFormat;set hoodie.hudicow.consume.mode = INCREMENTAL;set hoodie.hudicow.consume.max.commits = 3bot set hoodie.hudicow.consume.start.timestamp = xxxx;select count (*) from hudicow where `_ hoodie_commit_ time` > 'xxxx'

Note that the quotation marks of _ hoodie_commit_time are backquotes (the one above the tab key), not single quotes, 'xxxx' is single quotation marks

4.3 query for Hudi tables of MOR type

For example, the table name of the mor type Hudi source table is hudimor and maps to two Hive external tables hudimor_ro (ro table) and hudimor_rt (rt table)

4.3.1 optimized view of MOR table reading

In fact, it is to read the ro table, which is similar to the cow table and can be queried like the ordinary hive table after setting up the hiveInputFormat.

4.3.2 MOR table real-time view

After setting hive.input.format, you can query the latest data of the Hudi source table

Set hive.input.format = org.apache.hadoop.hive.ql.io.HiveInputFormat;select * from hudicow_rt;4.3.3 MOR table incremental query

This incremental query targets the rt table, not the ro table. Incremental queries that pass through COW tables are similar

Set hive.input.format = org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; / / this place is specified as HoodieCombineHiveInputFormatset hoodie.hudimor.consume.mode = INCREMENTAL;set hoodie.hudimor.consume.max.commits =-1 alternate set hoodie.hudimor.consume.start.timestamp = xxxx;select * from hudimor_rt where `_ hoodie_commit_ time` > 'xxxx'; / / this table name is rt table.

The explanation is as follows

Set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat

It is best to be used only for incremental queries on rt tables. Of course, other kinds of queries can also be set to this. This parameter will affect ordinary hive queries, so it should be set after the rt incremental query is completed.

Set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat

Or change to the default value

Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Used for queries for other tables.

Set hoodie.mytableName.consume.mode=INCREMENTAL

The incremental query mode is used only for this table. To switch the table to a different query mode, set the

Set hoodie.hudisourcetablename.consume.mode=SNAPSHOT

For some problems with the current Hudi (0.9.0) docking Hive, please use the master branch or the upcoming version 0.10.0

Hive reads the hudi table and prints out all the data. There are serious performance and data security problems.

To read the real-time view of the MOR table, please set the size of the mapreduce.input.fileinputformat.split.maxsize as needed to prevent hive from fetching the files read by sharding, otherwise there will be data duplication. This problem is currently unsolved. When spark reads the hudi real-time view, the code is written directly and the file will not be sliced. Hive needs to be set manually.

If you encounter classNotFound, noSuchMethod and other errors, please check the jar package under the hive lib library for conflicts.

5. Hive side source code modification

In order to support Hive to query Hudi's pure log file, you need to modify the Hive side source code.

Modify the org.apache.hadoop.hive.common.FileUtils function as follows

Public static final PathFilter HIDDEN_FILES_PATH_FILTER = new PathFilter () {@ Override public boolean accept (Path p) {String name = p.getName (); boolean isHudiMeta = name.startsWith (".hoodie"); boolean isHudiLog = false Pattern LOG_FILE_PATTERN = Pattern.compile ("\\. (. *) _ (. *)\. (. *)\. ([0-9] *) (_ (([0-9] *)-([0-9] *)-([0-9] *))?); Matcher matcher = LOG_FILE_PATTERN.matcher (name); if (matcher.find ()) {isHudiLog = true } boolean isHudiFile = isHudiLog | | isHudiMeta; return (! name.startsWith ("_") & &! name.startsWith (".") | | isHudiFile;}}

Recompile hive and replace the newly compiled hive-common-xxx.jar and hive-exec-xxx.jar with hive server's lib directory. Note that the permissions and names are the same as the original jar package.

Finally, restart hive-server.

These are all the contents of the article "sample Analysis of the Apache tutorial Hudi and Hive Integration Manual". Thank you for reading! I believe we all have a certain understanding, hope to share the content to help you, if you want to learn more knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.