Spark optimizes whether small and medium files need to be merged 07/06 Update SLTechnology News&Howtos

Spark optimizes whether small and medium files need to be merged

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

Spark optimization of small and medium-sized files need to be merged, many novices are not very clear about this, in order to help you solve this problem, the following small series will explain in detail for everyone, there are people who need this to learn, I hope you can gain something.

We know that most of Spark's computations are done in memory, so Spark's bottlenecks generally come from resource constraints on clusters (standalone, yarn, mesos, k8s), CPU, network bandwidth, and memory. Spark performance, want it fast, you have to make full use of system resources, especially memory and CPU. Sometimes we also need to make some optimization adjustments to reduce memory consumption, such as merging small files.

I. Problem phenomena

We have a table with 150,000 total data of 133MB, which takes 3 minutes to query using SELECT * FROM bi.dwd_tbl_conf_info, and another table with 5 million total data of 6.3 G, ods_tbl_conf_detail, which takes 23 seconds to query. Both tables are column-stored tables.

Big table query fast, but small table query slow, why can produce such a strange phenomenon?

II. Question inquiry

A table query with 6.3 gigabytes of data takes 23 seconds, whereas a small table query with 133 megabytes of data takes 3 minutes, which is very strange. We collected the corresponding table creation statements and found that there is not much difference between the two, most of which are String, and the number of columns in the two tables is not much different.

CREATE TABLE IF NOT EXISTS `bi`.` dwd_tbl_conf_info` ( `corp_id` STRING COMMENT '', `dept_uuid` STRING COMMENT '', `user_id` STRING COMMENT '', `user_name` STRING COMMENT '', `uuid` STRING COMMENT '', `dtime` DATE COMMENT '', `slice_number` INT COMMENT '', `attendee_count` INT COMMENT '', `mr_id` STRING COMMENT '', `mr_pkg_id` STRING COMMENT '', `mr_parties` INT COMMENT '', `is_mr` TINYINT COMMENT 'R', `is_live_conf` TINYINT COMMENT '' )CREATE TABLE IF NOT EXISTS `bi`.` ods_tbl_conf_detail` ( `id` string, `conf_uuid` string, `conf_id` string, `name` string, `number` string, `device_type` string, `j_time` bigint, `l_time` bigint, `media_type` string, `dept_name` string, `UPDATETIME` bigint, `CREATETIME` bigint, `user_id` string, `USERAGENT` string, `corp_id` string, `account` string )

Because the two tables are very simple SELECT query operations, without any complex aggregation join operations, there is no UDF-related operations, so basically confirm the query slow should occur when reading the table, we will suspect the point on the table reading operation. By querying the DAG and task distributions of the two query statements, we found differences.

Query fast table, query a total of 68 tasks, task distribution, such as uniform, an average of 7~9s, while query slow table, query a total of 1160 tasks, an average of 9s. As shown below:

At this point, we basically found out where the cat was. The big table is 6.3G, but the number of files is small, only 68, so it runs out quickly. Although the small table is only 133MB, the number of files is particularly large, resulting in a particularly large number of tasks, and because a single task itself is relatively fast, most of the time is spent on task scheduling, resulting in a longer task time.

How can we solve the problem of small table query slow?

III. Service Optimization

Then there is the present problem before us:

Why do small tables produce such small files?

How to merge such a small file that has been generated

With these two questions in mind, we talked to business developers about a discovery that small tables are inserted into small tables by business developers after querying and cleaning data according to different time slices from the original data tables, and because the time slices are relatively small, the number of such insertions is particularly large, resulting in a large number of small files.

Then we need to solve two problems, how to merge these historical small files and how to ensure that the subsequent business process no longer produces small files, we guide the business developers to do the following optimization:

Merge historical data using INSERT OVERWRITE bi.dwd_tbl_conf_info SELECT * FROM bi.dwd_tbl_conf_info. Because DLI has data consistency protection, it does not affect the reading and querying of the original data during OVERWRITE, and the new merged data will be used after OVERWRITE. After merging, the whole table query is shortened from 3 minutes to 9 seconds.

The original table is modified into a partitioned table, which is inserted into different partitions at different times. When querying, only partition data in the required time period is queried, further reducing the amount of data read.

Did reading the above help you? If you still want to have further understanding of related knowledge or read more related articles, please pay attention to the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.