How to use Apache Hudi to accelerate traditional batch processing mode 07/04 Update SLTechnology News&Howtos

How to use Apache Hudi to accelerate traditional batch processing mode

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)05/31 Report--

This article mainly explains "how to use Apache Hudi to accelerate the traditional batch mode". The content of the article is simple and clear, and it is easy to learn and understand. Please follow the editor's train of thought to study and learn "how to use Apache Hudi to accelerate the traditional batch mode".

Apache Hudi (Hudi for short) allows you to store large amounts of data on top of hadoop-compatible storage, and it also provides two primitives that allow streaming on the data lake in addition to classic batch processing.

1. Status note 1.1 data lake intake and calculation process-processing updates

In our use case, 1-10% is an update to the history. When recording an update, we need to delete the previous entry from the previous updated_date partition and add the entry to the latest partition. Without the delete and update function, we must reread the entire history table partition-> remove duplicate data-> overwrite the entire table partition with new deduplicated data

1.2 challenges in the current batch process

This process works, but it also has its own drawbacks:

Time and cost-need to cover the entire history table every day

Data versioning-there is no out-of-the-box data and manifest versioning (rollback, concurrent reads and writes, point-in-time queries, time travel, and related features do not exist)

Write magnification-Daily historical data overrides external (or self-managed) data version control in the scene increases write magnification, thus taking up more S3 storage

With Apache Hudi, we hope to find a better solution for deduplication and data versioning optimization while ingesting data into the data lake.

2. Hudi data Lake-query mode

When we began our journey to implement Apache Hudi on our data lake, we divided the tables into two categories according to the query patterns of the primary users of the tables.

For ETL: this refers to most of the raw / basic snapshot tables that we take from various production systems into the data lake. If these tables are widely used by ETL jobs, we keep the daily data partition at updated_date so that downstream jobs can simply read the latest updated_at partition and (reprocess) the data.

For analysts: most of the computational OLAP that usually includes dimension tables and business analyst queries, analysts usually need to look at data based on transaction (or event) created_date and care less about updated_date.

This is a sample e-commerce order data stream, from ingesting to the data lake to creating OLAP, and finally to business analysts querying it.

Because the date partition columns of the two types of tables are different, we use different strategies to solve these two use cases.

2.1 Table / OLAP for analysts (by created_date partition)

In Hudi, we need to specify the partition column and the primary key column so that Hudi can handle updates and deletions for us.

Here is the logic of how we handle updates and deletions in the analyst-oriented table:

Read the upstream data of Dmurn updated_date partitions.

Apply data conversion. Now this data will have only new inserts and few update records.

Issue a hudi upsert operation to upsert the processed data to the target Hudi table.

Because the primary key and created_date remain the same for exit and incoming records, Hudi obtains the partition and partition file path of the existing record by using this information from the created_date and primary_key columns of the incoming record.

2.2 for ETL (partitioned by update date)

When we started using Hudi, after reading many blogs and documents, it seemed logical to partition ETL-oriented tables on created_date.

In addition, Hudi provides incremental consumption, which allows us to partition the table on created_date and get only those records that are inserted (inserted or updated) on DMui 1 or Dmurn.

1. The challenge of "created_date" partition

This approach works well in theory, but it presents a series of other challenges in transforming incremental consumption in traditional daily batch processing:

Hudi maintains a schedule of all operations performed on the table at different times, and these submissions contain information about some files that are inserted or rewritten as part of the upsert, which we call the Hudi table Commit Timeline.

The important information to note here is that the incremental query is based on the submission timeline and does not depend on the actual update / creation date information that exists in the data record.

Cold start: when we migrate existing upstream tables to Hudi, the Dmur1 Hudi incremental query will get the complete table, not just Dmur1 updates. This happens because at the beginning, the entire table was created through a single initial commit or multiple commits that occurred within the Dmur1 commit timeline, and there was a lack of real incremental commit information.

Historical data re-ingestion: in each regular incremental Dmur1 pull, we expect only the updated records on Dmur1 as output. However, in the case of re-ingesting historical data, there will be a problem similar to the cold start problem described earlier, and OOM will also occur in downstream jobs.

As a solution to ETL-oriented jobs, we try to keep the data partitioning on the updated_date itself, but this approach has its own challenges.

2. The challenge of "updated_date" partition

We know the local index of the Hudi table, and Hudi relies on the index to obtain the Row-to-Part_file mapping stored in the local directory of the data partition. Therefore, if our table is partitioned in updated_date, Hudi cannot automatically delete duplicate records across partitions.

Hudi's global indexing strategy requires us to retain an internal or external index to maintain cross-partition data de-duplication. For a large amount of data, about 200 million records per day, this method either runs slowly or fails because of OOM.

Therefore, in order to solve the data duplication challenge of update date partition, we propose a new deduplication strategy, which also has high performance.

3. "New" deduplication policy

Find updates-filter only updates (1-10% of DI data) from the daily incremental load (where updated_date > created_date) (fast, map only)

Find outdated updates-connect these "updates" to downstream Hudi base table broadcasts. Because we only get updated records (only 1-10% of the daily increment), we can achieve high-performance broadcast connections. This provides us with all the existing records in the underlying Hudi table corresponding to the updated records

Delete obsolete updates-issue the Hudi delete command on these obsolete updates on the basic Hudi table path

Insert-issue the hudi insert command on the full daily incremental load on the basic hudi table path

Further optimize the _ hoodie_is_deleted column in the stale update with true and combine it with the daily incremental load. Issue the upsert command for this data through the basic hudi table path. It will perform inserts and deletions in a single operation (and a single commit).

4. Advantages of Apache Hudi

Time and cost-Hudi does not overwrite the entire table when deduplicated. It simply rewrites some of the files that receive updates. So the smaller upsert work

Data versioning-Hudi retains table versions (submission history), thus providing real-time queries (time travel) and table version rollback capabilities.

Write magnification-since only part of the file has been changed and retained for data inventory version control, we do not need to keep the version of the complete data. Therefore, the overall write amplification is the smallest.

As another benefit of data versioning, it solves the problem of concurrent reads and writes, because data versioning enables concurrent readers to read versioned copies of data files and does not throw FileNotFoundException files when the concurrent writer overwrites the same partition with new data.

Thank you for your reading, the above is the content of "how to use Apache Hudi to accelerate the traditional batch mode". After the study of this article, I believe you have a deeper understanding of how to use Apache Hudi to accelerate the traditional batch mode, and the specific use needs to be verified in practice. Here is, the editor will push for you more related knowledge points of the article, welcome to follow!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.