In addition to Weibo, there is also WeChat
Please pay attention
WeChat public account
Shulou
2025-01-16 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >
Share
Shulou(Shulou.com)06/01 Report--
This article introduces the relevant knowledge of "what is the flexible Payload mechanism of Apache Hudi". In the operation of actual cases, many people will encounter such a dilemma, so let the editor lead you to learn how to deal with these situations. I hope you can read it carefully and be able to achieve something!
1. Abstract
Payload of Apache Hudi is an extensible data processing mechanism. Through different Payload, we can customize the way of data writing in complex scenes, which greatly increases the flexibility of data processing. Hudi Payload A utility class that deduplicates, filters, merges, and so on data when writing and reading Hudi tables, and specifies the Payload class we need to use by using the parameter "hoodie.datasource.write.payload.class". In this article, we will explore in depth the mechanism of Hudi Payload and the differences and usage scenarios of different Payload.
two。 Why do you need Payload
When writing data, the existing methods of full row insertion and full row coverage can not meet all the requirements of the scenario, and the written data will also have some customized processing requirements, so it is necessary to have a more flexible writing method and certain processing of the written data. The playload method provided by Hudi can solve this problem very well, for example, it can solve the problem of de-duplicating data when writing, updating some fields and so on.
3. The action mechanism of Payload.
When writing to the Hudi table, you need to specify a parameter hoodie.datasource.write.precombine.field, and this field, also known as Precombine Key,Hudi Payload, processes data based on this specified field, which constructs each piece of data into a Payload, so the comparison between data becomes a comparison between Payload. The data processing can be realized only by implementing the comparison method of Payload according to the business requirements.
All Payload in Hudi implement the HoodieRecordPayload interface, and all the preset Payload classes that implement this interface are listed below.
The following figure lists the methods that need to be implemented for the HoodieRecordPayload interface. Here are two important methods, preCombine and combineAndGetUpdateValue, which we will analyze below.
3.1 preCombine analysis
As you can see from the following figure, this method compares the current data with oldValue, and then returns a record.
You can also know from the annotation description of the preCombine method that it is first used for data deduplication when multiple pieces of data with the same primary key are written to Hudi at the same time.
Call location
In fact, there is another place where this method is called, that is, the data of the same primary key in the Log file is processed when the MOR table is read.
If the same piece of data is modified multiple times and written to the Log file of the MOR table, preCombine will also occur when it is read.
3.2 combineAndGetUpdateValue analysis
This method compares currentValue (that is, data in existing parquet files) with new data to determine whether the new data needs to be persisted.
Due to the differences in reading and writing principles between the COW table and the MOR table, the call to combineAndGetUpdateValue is also different in COW and MOR:
When COW writes, the newly written data is compared with the currentValue stored in the Hudi table to return the data that needs to be persisted.
When MOR reads, the data in Log processed by preCombine is compared with the data in Parquet file, and the data that needs to be persisted is returned.
4. Comparison of commonly used Payload processing Logic
Now that we understand the kernel principle of Payload, let's compare and analyze the common ways of Payload implementation.
4.1 OverwriteWithLatestAvroPayload
The related methods of OverwriteWithLatestAvroPayload are implemented as follows
You can see that using OverwriteWithLatestAvroPayload makes a choice based on orderingVal (the orderingVal here is the value of precombine key), and combineAndGetUpdateValue always returns new data.
4.2 OverwriteNonDefaultsWithLatestAvroPayload
OverwriteNonDefaultsWithLatestAvroPayload inherits the same OverwriteWithLatestAvroPayload,preCombine method and overrides the combineAndGetUpdateValue method. The new data is compared with the default value in schema by field. If the default value is not null and the value in the new data is different, the field is updated in the new data. Since the default value defined by schema is usually null, you can update non-null fields in this scenario, that is, if a piece of data has five fields, using this Payload to update three fields will not affect the original values of the other two fields.
4.3 DefaultHoodieRecordPayload
DefaultHoodieRecordPayload also inherits OverwriteWithLatestAvroPayload and overrides the combineAndGetUpdateValue method. You can see that the Payload uses precombine key to compare the existing data with the new data to determine whether to update the data.
Let's take the COW table as an example to show different Payload read and write results tests.
5. test
We use the following source data, with key as the primary key and col3 as the preCombine key to write the Hudi table.
First of all, we write two pieces of data whose col0 is' aa' 'and' bb' at one time. Because their primary keys are the same, they will be deduplicated according to col3 comparison in precombine. Finally, only one piece of data is written to the Hudi table. (note that if the write mode is insert or bulk_insert, it will not be duplicated.)
Query result
Next we update with the data whose col0 is' cc', because the processing logic of the three Payload is different, and the result of the data written is also different.
OverwriteWithLatestAvroPayload completely overwrites the old data with the new data.
OverwriteNonDefaultsWithLatestAvroPayload this field was not updated because the col1 col2 in the updated data is null.
DefaultHoodieRecordPayload this data has not been updated because the col3 of cc is smaller than that of bb.
6. Summary
Through the above analysis, we are clear about several Payload mechanisms commonly used in Hudi, and the summary and comparison are as follows
Payload update logic and applicable scenarios OverwriteWithLatestAvroPayload always updates all fields of old data with new data, which is suitable for every update data is complete OverwriteNonDefaultsWithLatestAvroPayload updates non-empty fields in the new data to the old data, suitable for each update data only some fields DefaultHoodieRecordPayload according to precombine key comparison whether to update data, suitable for real-time entry into the lake and out of order
Although Hudi provides several preset Payload, it still can not meet the data processing work of some special scenarios: for example, users are using Kafka-Hudi to enter the lake in real time, but the modification of a piece of data of the user is not in a Kafka message, but multiple data messages with the same primary key. The first one contains col0,col1 data, the second contains col2,col3 data, and the third has col4 data. At this time, it is impossible to write these three pieces of data into the Hudi table by using the Payload that comes with Hudi. To achieve this logic, we must implement the corresponding business logic by customizing Payload, overriding the preCombine and combineAndGetUpdateValue methods in Payload, and specifying our custom Payload implementation through hoodie.datasource.write.payload.class when writing.
This is the end of the content of "what is the flexible Payload mechanism of Apache Hudi". Thank you for reading. If you want to know more about the industry, you can follow the website, the editor will output more high-quality practical articles for you!
Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.
Views: 0
*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.
Continue with the installation of the previous hadoop.First, install zookooper1. Decompress zookoope
"Every 5-10 years, there's a rare product, a really special, very unusual product that's the most un
© 2024 shulou.com SLNews company. All rights reserved.