How to understand the optimization scheme of MaxCompute small file problem 07/19 Update SLTechnology News&Howtos

How to understand the optimization scheme of MaxCompute small file problem

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)06/01 Report--

This article introduces you how to understand the optimization scheme of the MaxCompute small file problem, the content is very detailed, interested friends can refer to, hope to be helpful to you.

Small file definition

Distributed file systems are stored in block Block, and files whose file size is larger than block size (the default block size is 64m) are called small files.

How to judge the problem of a large number of small files

View the number of files

Desc extended + Table name

Cdn.com/8195ee2319091002c1cfa397341611b3c77b15c1.png ">

A criterion for judging the large number of small files

1. Non-partitioned table, the number of table texts reaches 1000, and the average file size is less than 64m.

2. Partition table: a) the number of files in a single partition reaches 1000, and the average file size is less than 64m.

B) the total number of partitions in the non-partitioned table reaches 50,000 (the system limit is 60,000)

The main reason for the large number of small files

1. Unreasonable table design leads to too many files, such as partitioning by business units by day and hour (if there are 6 business units BU), then the number of partitions will reach 365246-52560 in a year.

2. When uploading data using data integration tools such as Tunnel, Datahub, Console, etc., frequent Commit and unreasonable use of write tables (table partitions) result in: there are multiple files in each partition, and the number of files is hundreds or thousands, most of which are small files with a size of only a few k.

3. When using insert into to write data, several pieces of data are written once and frequently.

4. Too many small files are generated in the Reduce process.

5. Various temporary files are generated during the execution of Job, and too many expired files are retained by the Recycle Bin.

Note: although small file merging optimization is automatically done on the MaxCompute system side, reasons 1, 2 and 3 require customers to adopt reasonable table partition design and upload data methods can be avoided.

The impact of too many small files

MaxCompute is more efficient to deal with a single large file than to deal with multiple small files, too many small files will affect the overall execution performance; too many small files will bring some pressure to the file system and affect the effective use of space. MaxCompute limits the number of small files that can be processed by a single fuxi Instance to 120. too many files affect the number of fuxi instance and the overall performance.

Merge small files command set odps.merge.max.filenumber.per.job=50000;-the default value is 50000. When the number of partitions is greater than 50000, it needs to be adjusted to a maximum of 10 billion. If the number of partitions is greater than 1000000, the mergeALTER TABLE table name [partition] MERGE SMALLFILES; how to merge small file partition tables:

If your table is already a partitioned table, please check whether your partition field is convergent. If too many partitions will also affect computing performance, it is recommended to use date for partitioning.

1. Execute commands to merge small files on a regular basis

2. If the partition is built by date, the partition data of the previous day can be overwritten with insert overwrite every day.

For example:

Insert overwrite table tableA partition (ds='20181220') select * from tableA where ds='20181220'; non-partitioned table:

If your table is a non-partitioned table, you can periodically execute the merge small file command to optimize the small file problem, but it is strongly recommended that you design a partitioned table:

1. Create a new partition table first. It is recommended to partition by date and set the life cycle reasonably to facilitate historical data recovery.

2. Import the data from the original non-partitioned table into the new partitioned table; (it is recommended to suspend the real-time writing business of the original non-partitioned table)

For example:

Create table sale_detail_patition like sale_detail;alter table sale_detail_insert add partition (sale_date='201812120', region='china'); insert overwrite table sale_detail_patition partition (sale_date='20181220', region='china') select * from sale_detail

3. Modify the upstream and downstream business: the storage program is changed to write the new partition table, and the query operation is changed to query from the new partition table.

4. After the new partition table completes data migration and verification, delete the original partition table.

Note: if you use insert overwrite to rewrite full data and merge small files, be careful not to have both insert overwrite and insert into at the same time, otherwise there is a risk of data loss.

How to avoid generating small files to optimize table design

If you design a reasonable table partition, the partition field is convergent or manageable as far as possible. if there are too many partitions, it will also affect the computing performance. It is recommended to use the date as the partition and set the life cycle of the table reasonably to facilitate the recovery of historical data. It can also control your storage cost.

Avoid using various data integration tools to generate small files

1. Tunnel- > MaxCompute

Avoid frequent commit when uploading data using Tunnel, and try to ensure that the DataSize submitted each time is greater than 64m. Please refer to "Best practices and FAQs of offline batch data Channel Tunnel"

2. Datahub- > MaxCompute

If you use Datahub to generate small files, it is recommended to apply for shard reasonably, and you can merge shard reasonably according to the Throughput of topic to reduce the number of shard. You can observe the changes of data flow according to the Throughput of topic, and adjust the interval between big data writes appropriately.

Strategy for applying for the number of Datahub shard (too many datahub shard applications will cause small file problems)

1) the default throughput of a single shard is 1MB/s, according to which the actual number of shard can be allocated (several more can be added on this basis)

2) the logic of synchronizing MaxCompute is that each shard has a separate task (5 minutes or 64MB will commit once). The default setting of 5 minutes is to find data in MaxCompute as soon as possible. If the partition is built on an hourly basis, that shard has 12 files per hour. If the amount of data is small at this time, but there is a lot of shard, there will be a lot of small files (shard*12/hour) in MaxCompute. So don't allocate too much shard, allocate as needed.

Reference suggestion: if the traffic is 5M/s, then apply for 5 shard, set aside 20% Buffer to prevent traffic peak, and you can apply for 6 shard.

3. DataX- > MaxCompute

Because datax also encapsulates the SDK of tunnel to write to MaxCompute, it is recommended that when configuring ODPSWriter, the parameter blockSizeInMB should not be set too small, preferably above 64m.

On how to understand the MaxCompute small file problem optimization scheme is shared here, I hope that the above content can be of some help to you, can learn more knowledge. If you think the article is good, you can share it for more people to see.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.