What is the best tuning of the data lake? 07/19 Update SLTechnology News&Howtos

What is the best tuning of the data lake?

2025-07-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

In this issue, the editor will bring you about the best tuning of the data lake. The article is rich in content and analyzes and narrates it from a professional point of view. I hope you can get something after reading this article.

1. Select the best and appropriate partition column

It is recommended that you specify a partition column for the delta table. The most common partition column in an enterprise is date, which is geographically. Follow these two rules of thumb to decide which column to partition by:

a. If the column has a high base, do not use the column for partitioning. For example, if you partition by userId column, the number of partitions may be the total number of users, which is obviously not a good partitioning strategy.

b. The amount of data in each partition: if you want the data in this partition to be at least 1 GB, you can partition by columns that meet this requirement.

two。 Merge Fil

If you are constantly writing data to the Delta table, a large number of files will be generated over time, especially if a small amount of data is added. This may greatly slow down the query rate of the table, and may also affect the performance of the file system. Ideally, a large number of small files should be rewritten into a small number of larger files on a regular basis.

You can compress the table by repartitioning the table into a smaller number of files. In addition, you can specify the dataChange configuration as false to indicate that the operation does not change the data, but only rearranges the data layout. This ensures that the impact of this compression operation on other parallel operations is minimal.

For example, you can compress a table into 16 files:

Val path = "..." val numFiles = 16

Spark.read .format ("delta") .load (path) .repartition (numFiles) .write .option ("dataChange", "false") .format ("delta") .mode ("overwrite") .save (path)

If the table is partitioned and you want to repartition only one partition based on the predicate, you can use where to read only the partition and write back the partition using replaceWhere:

Val path = "..." val partition = "year = '2019'" val numFilesPerPartition = 16spark.read .format ("delta") .load (path) .where (partition) .repartition (numFilesPerPartition) .write .option ("dataChange", "false") .format ("delta") .mode ("overwrite") .option ("replaceWhere", partition) .save (path)

Warning:

Configuring dataChange = false on the operation of changing the data may corrupt the data in the table.

Performance tuning of 3.merge operations

The following methods can effectively reduce the processing time of merge:

a. Reduce the amount of data for matching lookups

By default, the merge operation scans the entire delta lake table for data that meets the criteria. Predicates can be added to reduce the amount of data. For example, the data is partitioned by country and date, and you only want to update yesterday's data for a specific country. You can add some conditions, such as:

Events.date = current_date () AND events.country = 'USA'

In this way, only the data of the specified partition will be processed, which greatly reduces the amount of data scanning. Some conflicts between operations between different partitions can also be avoided.

b. Merge Fil

If there are many small files when the data is stored, it will slow down the reading speed of the data. You can merge small files into some large files to improve the speed of reading. I'll talk about this later.

c. Controls the number of partitions for shuffle

In order to calculate and update the data, the merge operation shuffle the data multiple times. The number of task in the shuffle process is set by the parameter spark.sql.shuffle.partitions, which defaults to 200. This parameter can not only control the parallelism of shuffle, but also determine the number of files output. Although increasing this value can increase the degree of parallelism, it also increases the number of small files.

d. Repartition between written data

For partitioned tables, the merge operation produces a lot of small files, which is much more than the number of shuffle partitions. The reason is that each shuffle task produces more files for multi-partitioned tables, which can be a performance bottleneck. Therefore, it is effective to repartition data before writing using the partitioning column of the table in many scenarios. You can take effect by setting spark.delta.merge.repartitionBeforeWrite to true.

The above is the best tuning of the data lake shared by the editor. If you happen to have similar doubts, you might as well refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.