Example Analysis of Apache Hudi Kernel File marking Mechanism 07/12 Update SLTechnology News&Howtos

Example Analysis of Apache Hudi Kernel File marking Mechanism

2025-07-12 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Development >

Shulou(Shulou.com)06/01 Report--

This article mainly introduces the example analysis of Apache Hudi kernel file marking mechanism, which has a certain reference value, and interested friends can refer to it. I hope you will gain a lot after reading this article.

1. Abstract

Hudi supports automatic cleanup of unsuccessfully committed data at write time. Apache Hudi introduces a marking mechanism when writing to effectively track the data files that are written to storage. In this blog, we will delve into the design of the existing direct tag file mechanism and explain its performance issues for very large writes on cloud storage (such as AWS S3, Aliyun OSS). It also demonstrates how to improve write performance by introducing timeline server-based tags.

two。 Why to introduce Markers mechanism

Marker in Hudi is a tag that indicates the existence of a corresponding data file in storage, which Hudi uses to automatically clean up uncommitted data in failure and rollback scenarios.

Each tag entry consists of three parts

Data file name

Tag extension (.tags)

The CREATE O operation that creates the file (either CREATE-insert, MERGE-update / delete, or APPEND-one of both).

For example, the tag 91245ce3-bb82-4f9f-969e-343364159174-0140579-0_20210820173605.parquet.marker.CREATE indicates that the corresponding data file is 91245ce3-bb82-4f9f-969e-343364159174-0140579-0_20210820173605.parquet

And the Icano type is CREATE.

Before writing to each data file, the Hudi write client first creates a tag in the storage that is persisted and explicitly deleted by the write client after a successful commit.

Tags are useful for writing clients to perform different operations effectively, and they have two main functions

Delete duplicate / partial data files: when writing to Hudi through Spark, multiple Executor will write concurrently. An Executor may fail, leaving part of the data file to be written, in which case the Spark will retry the Task. When speculative execution is enabled, multiple times attempts can successfully write the same data to different files, but only once the attempt will be handed over to the Spark Driver process for submission. Tags help to effectively identify some of the data files written, which contain duplicate data compared to later successfully written data files, and clean up these duplicate data files before the write and commit are complete.

Rollback failed commit: write may fail in the middle, leaving a partially written data file. In this case, the marked entry remains in the store when the submission fails. In the next write operation, the write client first rolls back the failed commits, identifying the data files written in these commits by tags and deleting them.

Next, we will deeply study the existing marking mechanism, explain its performance problems, and demonstrate a new timeline server-based marking mechanism to solve this problem.

3. The existing direct marking mechanism and its limitations

The existing markup mechanism simply creates a new tag file corresponding to each data file, with the tag file name as described earlier. Each marker file is written in the same directory hierarchy, that is, commit instant and partition paths, under the temporary folder .hoodie / .temp under the base path of the Hudi table. For example, the following figure shows an example of a tag file and a corresponding data file created when writing data to the Hudi table. When getting or deleting all marker file paths, the mechanism first lists all paths under the temporary folder .hoodie / .temp /, and then operates.

Although it is more efficient to scan the entire table for uncommitted data files, as the number of data files to write increases, so does the number of tag files to be created. This may create performance bottlenecks for cloud storage such as AWS S3. In AWS S3, each call to file creation and deletion triggers a HTTP request, and there is a rate limit on the number of requests per second that can be processed by each prefix in the bucket. When the number of data files and marker files written concurrently is large the operation of marker files will become a significant performance bottleneck of write performance. On storage such as HDFS, users may hardly notice this, where file system metadata is effectively cached in memory.

4. Marking mechanism based on timeline server to improve write performance

In order to solve the performance bottleneck caused by the above AWS S3 rate limit, we introduce a new marking mechanism using the timeline server, which optimizes the related latency of storage tags. The timeline server in Hudi is used to provide file system and timeline views. As shown in the following figure, the new timeline server-based marking mechanism delegates tag creation and other tag-related operations from each executor to the timeline server for centralized processing. The timeline server requests to maintain the created tags for the corresponding tags in memory, and the timeline servers achieve consistency by periodically flushing the memory tags to a limited number of underlying files in storage. In this way even if the number of data files is large the number of actual file operations and delays associated with tags can be significantly reduced thereby improving write performance.

In order to improve the efficiency of processing tag creation requests, we have designed to batch process tag requests on the timeline server. Each tag creation request is processed asynchronously in the Javalin timeline server and queued before processing. For each batch interval, for example, 20 milliseconds, the scheduling thread pulls pending requests from the queue and sends them to the worker thread for processing. Each worker thread processes the tag creation request and rewrites the underlying file that stores the tag. There are multiple worker threads running concurrently, considering that the file coverage time is longer than the batch processing time, each worker thread writes an exclusive file that is not touched by other threads to ensure consistency and correctness. Both the batch interval and the number of worker threads can be configured through the write option.

Note that the worker thread always checks whether the tag has been created by comparing the tag name in the request with the memory copies of all tags maintained on the timeline server. The underlying file that stores the tag is read only when the first tag request (delayed load). The requested response is returned only after the new tag is refreshed to the file, so that the timeline server can recover the created tag in the event of a timeline server failure. These ensure consistency between storage and replicas in memory and improve performance in processing tag requests.

5. Tag-related write options

We introduced the following new write options related to tags in version 0.9.0 to configure the marking mechanism.

Hoodie.write.markers.type, the type of tag to use. Two modes are supported: direct, where a separate tag file corresponding to each data file is created directly by the writer, and timeline_server_based, where marking operations are all processed as proxies in the timeline service. In order to improve efficiency, new tag entries are batch processed and stored in a limited number of base files. The default is direct.

Hoodie.markers.timeline_server_based.batch.num_threads, the number of threads used to batch mark creation requests on the timeline server. The default value is 20.

Hoodie.markers.timeline_server_based.batch.interval_ms, marking the batch interval in milliseconds at which the batch was created. The default value is 50.

6. Performance

We evaluate the write performance of the marking mechanisms of direct and timeline_server_based by using Amazon EMR and Spark and S3 to bulk insert large datasets. The input data is approximately 100GB. We configure write operations to occur into a large number of data files by setting the maximum parquet file size to 1MB and parallelism to 240. As we mentioned earlier, the delay of the direct marking mechanism is acceptable for a small number of incremental writes, and the overhead increases sharply for bulk inserts / writes that produce more data files.

As shown in the following figure, because it is a batch, the timeline server-based marking mechanism generates much fewer files for storage tags, resulting in much less time for tag-related Iripple O operations, resulting in a 31% reduction in write completion time compared to direct. Tag file mechanism.

Thank you for reading this article carefully. I hope the article "sample Analysis of Apache Hudi Kernel File marking Mechanism" shared by the editor will be helpful to you. At the same time, I also hope that you will support and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.