How to deal with the problem of small files intelligently by Apache Hudi 07/02 Update SLTechnology News&Howtos

How to deal with the problem of small files intelligently by Apache Hudi

2025-07-02 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

Apache Hudi how to intelligently deal with small file problems, many novices are not very clear about this, in order to help you solve this problem, the following editor will explain for you in detail, people with this need can come to learn, I hope you can gain something.

1. Introduce

Apache Hudi is a popular open source data lake framework, and a very important feature provided by Hudi is to automatically manage file sizes without user intervention. A large number of small files will lead to poor query analysis performance because the query engine needs to open / read / close too many files when executing the query. Constantly ingesting data in a streaming scene, if not processed, will produce a lot of small files.

two。 When writing, after vs writes

A common processing method first writes many small files, and then merges them into large files to solve the system scalability problems caused by small files, but exposing too many small files may not guarantee the SLA of the query. In fact, for Hudi tables, this can be easily done through the Clustering function provided by Hudi. For more details, please refer to the previous article to reduce query time by 60%! Apache Hudi data layout cool techs to understand.

This article will introduce Hudi's file size optimization strategy, that is, processing at write time. Hudi manages the file size itself to avoid exposing small files to the query engine, in which automatic processing of file size plays an important role.

When performing insert/upsert operations, Hudi can maintain the file size at a specified file size (note: this feature is not available for bulk_insert operations, which is mainly used to replace spark.write.parquet to quickly write data to Hudi).

3. Configuration

We use the COPY_ON_ write table to demonstrate how Hudi automatically handles the file size feature.

The key configuration items are as follows:

Hoodie.parquet.max.file.size [1]: maximum size of the data file, Hudi will try to maintain the file size to the specified value; hoodie.parquet.small.file.limit [2]: files less than this size are considered small files; hoodie.copyonwrite.insert.split.size [3]: the number of records inserted in a single file, this value should match the number of records in a single file (can be determined based on the maximum file size and each record size)

For example, if your first configuration value is set to 120MB, and the second configuration value is set to 100MB, then any file whose size is less than 100MB will be regarded as a small file. If you want to turn off this function, set the hoodie.parquet.small.file.limit configuration value to 0.

4. Example

Suppose the layout of the data file under a specified partition is as follows

Let's assume that the configured hoodie.parquet.max.file.size is 120MB camera hoodie.parquet.small.file.limit is 100MB. The File_1 size is 40MB, the file size is 80MB, the file size 3 is 90MB, the file size is 130MB, and the 105MB is 50.When there are new writes, the process is as follows:

Step 1: assign the update to the specified file, this step will find the index to find the corresponding file, assuming that the update will increase the size of the file, which will cause the file to become larger. When updates reduce the file size (for example, invalidating many fields), subsequent writes make the file smaller and smaller.

Step 2: determine the small files under each partition according to hoodie.parquet.small.file.limit. In our example, the configuration is 100MB, so the small files are File_1, File_2 and File_3.

Step 3: after determining the small file, the newly inserted records will be assigned to the small file so that it can reach 120MB. 80MB size records will be inserted, File_2 will insert 40MB size records, and File_3 will insert 30MB size records.

Step 4: after all the small files have been assigned the corresponding number of insert records, if there are any unassigned insert records, these records will be assigned to the newly created FileGroup/ data file. The number of records in the data file is determined by hoodie.copyonwrite.insert.split.size (or the size of each record is automatically calculated from previous writes, and then the number of records that can be inserted based on the configured maximum file size). Assuming that the final value is 120K (the size of each record is 1K), if there are 300K records left, 3 new files (File_6,File_7,File_8) will be created. Both File_6 and File_7 will allocate 120K records, and File_8 will allocate 60K records, totaling 60MB. When writing later, File_8 will be considered as a small file and more data can be inserted.

Hudi performs the above algorithm by using mechanisms such as custom partitions to optimize the ability of records to be assigned to different files. "after this round of writes is complete, all files except File_8 have been resized to their optimal size, and each write follows this process to ensure that there are no small files in the Hudi table."

Is it helpful for you to read the above content? If you want to know more about the relevant knowledge or read more related articles, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.