Example Analysis of data skew in Hive performance tuning 10/19 Update SLTechnology News&Howtos

Example Analysis of data skew in Hive performance tuning

2025-10-19 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/01 Report--

This article is about a sample analysis of data skewing in Hive performance tuning. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

Map number

Typically, jobs generate one or more map tasks from the input directory. The main determining factors are the total number of files in input, the file size of input, and the file block size set by the cluster (currently 128m, which can be seen in hive through the set dfs.block.size; command. This parameter cannot be customized and modified).

For example: a) A large file: if there is a file an in the input directory with a size of 780m, then hadoop will separate the file an into 7 blocks (6 128m blocks and 1 12m block), resulting in 7 map numbers. B) multiple small files: assuming that there are three files under the input directory, the size of the hadoop is 10m, 20m, 150m, respectively, then the hadoop will be divided into 4 blocks (10m, 20m, 128m, 22m), resulting in four map numbers. That is, if the file is larger than the block size (128m), it will be split, and if it is less than the block size, the file will be treated as a block.

Is it true that the more map, the better? The answer is no. If a task has many small files (far less than the block size 128m), each small file will also be treated as a block and completed with a map task, while a map task will take much longer to start and initialize than the logical processing time, which will result in a great waste of resources. Moreover, the number of map that can be executed simultaneously is limited.

Does it make sure that each map handles close to 128m of file blocks, so you can rest easy? The answer is not necessarily. For example, there is a 127m file, which normally uses a map to complete, but this file has only one or two fields, but it has tens of millions of records. If the logic of map processing is more complex, it must be time-consuming to do it with a map task.

In view of the above problems 3 and 4, we need to take two ways to solve them: reducing the number of map and increasing the number of map.

How to increase the map number appropriately

When the files of input are very large, the logic of the task is complex, and the execution of map is very slow, we can consider increasing the number of Map to reduce the amount of data processed by each map, so as to improve the efficiency of task execution. For article 4 above, assume that there is a task:

Select data_desc

Count (1)

Count (distinct id)

Sum (case when …)

Sum (…)

From a group by data_desc

If table a has only one file, the size is 120m, but contains tens of millions of records, if you use 1 map to complete this task, it must be more time-consuming, in this case, we should consider splitting this file into multiple reasonably, so that you can use multiple map tasks to complete.

Set mapreduce.job.reduces = 10

Create table a_1 as

Select * from a

Distribute by rand ()

In this way, the records of table a will be randomly distributed into the aqum1 table containing 10 files, and then replace table an in the above sql with aqum1, which will take 10 map tasks to complete.

Each map task will certainly be much more efficient when processing data larger than 12m (millions of records).

It seems that there is some contradiction between these two kinds, one is to merge small files, the other is to split large files into small files, which is the key point to pay attention to. According to the actual situation, controlling the number of map needs to follow two principles: to make a large amount of data use the appropriate number of map, and to make a single map task handle the appropriate amount of data.

Adjust number of reduce

Method 1 for adjusting the number of reduce

A) the amount of data processed by each Reduce defaults to 256MB

Hive.exec.reducers.bytes.per.reducer=256123456

B) maximum number of reduce per task, default is 1009

Hive.exec.reducers.max=1009

C) Formula for calculating reducer numbers

N=min (Parameter 2, Total input data / Parameter 1)

Parameter 1: maximum amount of data processed per Reduce Parameter 2: maximum number of Reduce per task

Method 2 for adjusting the number of reduce

Modify and set the number of Reduce per job in the mapred-default.xml file of hadoop

Set mapreduce.job.reduces = 15

The number of reduce is not the more the better.

A) too much startup and initialization of reduce will also take time and resources; b) there will be as many output files as there are reduce. If many small files are generated, then if these small files are used as input for the next task, there will also be the problem of too many small files.

Thank you for reading! This is the end of this article on "sample analysis of data skew in Hive performance tuning". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.