Example Analysis of Apache Hudi Multi-version cleanup Service 07/06 Update SLTechnology News&Howtos

Example Analysis of Apache Hudi Multi-version cleanup Service

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article shares with you the content of a sample analysis of the Apache Hudi multi-version cleaning service. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Reclaim space to control storage cost

Hudi provides different table management services to manage the data of tables on the data lake, one of which is called Cleaner (cleanup service). As the user writes more data to the table, for each update, Hudi generates a new version of the data file to save the updated record (COPY_ON_WRITE) or writes these incremental updates to the log file to avoid overwriting the updated version of the data file (MERGE_ON_READ). In this case, depending on the update frequency, the number of file versions may grow indefinitely, but if you do not need to keep unlimited history, there must be a process (service) to retrieve the old version of the data, which is Hudi's cleaning service.

two。 Problem description

In the data lake architecture, it is very common for both the reader and the writer to access the same table at the same time. Because the Hudi cleanup service periodically reclaims older file versions, there may be long-running queries accessing the file versions recycled by the cleanup service, so you need to use the correct configuration to ensure that the query does not fail.

3. Learn more about Hudi cleanup services

For the above scenarios, let's first take a look at the different cleaning strategies provided by Hudi and the corresponding properties that need to be configured. Hudi provides two ways to clean up asynchronously or synchronously. Before going into more detail, let's explain some basic concepts:

Hudi Base File (HoodieBaseFile): a column file consisting of compressed final data. The name of the base file follows the following naming convention: _ _ .parquet. The file ID remains the same in subsequent writes to this file, and the commit time is updated to show the latest version. This also means that any particular version of a record, given its partition path, can be uniquely located using files ID and instantTime.

File slices (FileSlice): in the case of the MERGE_ON_READ table type, file slices consist of base files and multiple incremental log files.

Hudi filegroup (FileGroup): any filegroup in Hudi is uniquely identified by the partition path and file ID, and the files in this group are part of their name. A filegroup consists of all slices in a specific partition path. In addition, any partition path can have multiple filegroups.

4. Clean-up service

The Hudi cleanup service currently supports the following cleanup policies:

KEEP_LATEST_COMMITS: this is the default policy. This cleanup policy ensures that all changes that occurred in the previous X commits are traced back. Assuming that data is ingested into the Hudi dataset every 30 minutes, and the longest running query may take five hours to complete, the user should keep at least the last 10 commits. With this configuration, we ensure that the oldest version of the file is kept on disk for at least 5 hours, thus preventing the longest-running query from failing at any point in time, and incremental cleanup can also be done with this strategy.

KEEP_LATEST_FILE_VERSIONS: this strategy has the effect of keeping N file versions without time limit. This strategy is useful when you know how many MAX versions of files you want to keep at any given time, and in order to achieve the same behavior to prevent long-running queries from failing as before, it should be calculated according to the data schema, or if the user only wants to maintain the latest version of the file.

5. Examples

Suppose the user ingests the data into the Hudi dataset of type COPY_ON_WRITE every 30 minutes, as follows:

Figure 1: extracting incoming records into the hudi dataset every 30 minutes

The figure shows a specific partition on the DFS where the submission and the corresponding file version are color-coded. Four different filegroups are created in this partition, as shown in fileId1, fileId2, fileId3, and fileId4. The filegroup for fileId2 contains records for all 5 commits, while the group for fileId4 contains only the records for the last 2 commits.

Suppose the following configuration is used for cleanup:

Hoodie.cleaner.policy=KEEP_LATEST_COMMITShoodie.cleaner.commits.retained=2

Cleaner selects the version of the file to clean up by dealing with the following:

You should not clean up the latest version of the file.

Determine the submission time for the last 2 (configured) + 1 submissions. In figure 1, commit 10:30 and commit 10:00 correspond to the latest two commits in the timeline. Contains an additional commit because the time window to retain the commit is essentially equal to the longest query run time. So if the longest query takes an hour to complete and ingestion occurs every 30 minutes, you need to keep the last two commits since 23060 (1 hour). At this point, the longest query can still use files written in the 3rd commit in reverse order. This means that if a query starts executing after 9:30 in commit, it will still run when a cleanup operation is triggered after 10:30 in commit, as shown in figure 2.

Now for any filegroup, only those file slices that do not have a save point (another Hudi service) and the commit time is less than the third commit ("commit 9:30" in the following figure) are cleaned.

Figure 2: keep the files submitted for the last 3 times

Suppose the following configuration is used for cleanup:

Hoodie.cleaner.policy=KEEP_LATEST_FILE_VERSIONShoodie.cleaner.fileversions.retained=1

The cleanup service does the following:

For any filegroup, the latest version of the file slice (including anything to be compressed) is retained and the rest is cleaned up. As shown in figure 3, if the cleanup operation is triggered immediately after 10:30 in commit, the cleanup service will simply keep the latest version in each filegroup and delete the rest.

Figure 3: keep the latest file version in each filegroup

6. Configuration

Details and default values for all possible configurations can be found here.

7. Run command

Hudi's cleanup table service can be run as a separate process, along with data ingestion. As mentioned earlier, it will erase any stale files. If you want to run it with ingested data, you can run it synchronously or asynchronously using configuration. Or you can run the cleanup service independently using the following command:

[hoodie] $spark-submit-- class org.apache.hudi.utilities.HoodieCleaner\-- props s3:///temp/hudi-ingestion-config/config.properties\-- target-base-path s3:///temp/hudi\-- spark-master yarn-cluster

If you want to run the cleanup service asynchronously with writes, you can configure the following:

Hoodie.clean.automatic=truehoodie.clean.async=true

You can also use Hudi CLI to manage Hudi datasets. CLI provides the following commands for the cleanup service:

Cleans show

Clean showpartitions

Clean run

More details and related code for these commands can be found in the org.apache.hudi.cli.commands.CleansCommand class.

Thank you for reading! This is the end of the article on "sample Analysis of Apache Hudi Multi-version cleaning Service". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.