How to control the number of map and reduce in hive tasks 07/06 Update SLTechnology News&Howtos

How to control the number of map and reduce in hive tasks

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly shows you "how to control the number of maps and reduce in hive task". The content is simple, easy to understand and clearly organized. I hope it can help you solve your doubts. Let Xiaobian lead you to study and learn this article "how to control the number of maps and reduce in hive task".

1. Control the number of maps in hive tasks:

1. Typically, a job will generate one or more map tasks via the input directory.

The main determining factors are: the total number of input files, the file size of input, and the file block size set by cluster (currently 128M, which can be viewed in hive through the command set dfs.block.size;, and this parameter cannot be customized);

2. Examples:

a) Suppose there is a file a in the input directory with a size of 780M, then hadoop will divide the file a into 7 blocks (6 blocks of 128m and 1 block of 12m), resulting in 7 map numbers.

b) Suppose there are 3 files a,b,c in the input directory, the size is 10m, 20m, 130m respectively, then hadoop will be divided into 4 blocks (10m,20m,128m,2m), thus generating 4 map numbers.

That is, if the file is larger than the block size (128m), it is split, and if it is smaller than the block size, the file is treated as a block.

3. Is it better to have more maps?

The answer is no. If a task has many small files (much smaller than the block size of 128m), each small file will also be treated as a block and completed with a map task.

However, the startup and initialization time of a map task is much longer than the logic processing time, which will cause a great waste of resources.

Moreover, the number of maps that can be executed simultaneously is limited.

4. Is it guaranteed that each map handles nearly 128m file blocks, so you can rest easy?

The answer is not necessarily. For example, if there is a 127m file, it will normally be completed with a map, but this file has only one or two small fields, but there are tens of millions of records.

If the logic of map processing is more complex, it must be time-consuming to do it with a map task.

For problems 3 and 4 above, we need to take two ways to solve them: reduce the number of maps and increase the number of maps;

How to merge small files and reduce the number of maps?

Consider an SQL task:

Select count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04';

inputdir /group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 for this task

There are 194 files in total, many of which are much smaller than 128m, with a total size of 9G. Normal execution will use 194 map tasks.

Map Total computing resources consumed: SLOTS_MILLIS_MAPS= 623,020

I reduce the number of maps by merging small files before map execution by:

set mapred.max.split.size=100000000;

set mapred.min.split.size.per.node=100000000;

set mapred.min.split.size.per.rack=100000000;

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

Executing the above statement takes 74 map tasks, and the computational resources consumed by map are: SLOTS_MILLIS_MAPS= 333,500

For this simple SQL task, it may take about the same amount of time to execute, but it saves half the computational resources.

roughly explain, 10000000 means 100M, set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; this parameter indicates that small files are merged before execution,

The first three parameters determine the size of the merged file block. If the file block size is larger than 128m, it will be separated by 128m. If it is smaller than 128m, it will be separated by 100m. Those smaller than 100m (including small files and the rest of large files) will be separated.

and eventually 74 blocks were generated.

How to increase the number of maps appropriately?

When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map, thus improving the task execution efficiency.

Suppose there is a task like this:

Select data_desc,

count(1),

count(distinct id),

sum(case when …),

sum(case when ...),

sum(…)

from a group by data_desc

If table a has only one file, the size is 120M, but contains tens of millions of records, if you use a map to complete this task, it must be more time-consuming. In this case, we have to consider splitting this file into multiple files reasonably.

This allows multiple map tasks to be completed.

set mapred.reduce.tasks=10;

create table a_1 as

select * from a

distribute by rand(123);

In this way, the records of table a will be randomly scattered into table a_1 containing 10 files, and then replace table a in sql with a_1, which will take 10 map tasks to complete.

Each map task handles more than 12 megabytes (millions of records) of data, which is certainly much more efficient.

It seems that there is some contradiction between these two kinds. One is to merge small files, and the other is to split large files into small files. This is exactly the place that needs to be paid attention to.

According to the actual situation, two principles should be followed to control the number of maps: make large data use appropriate map number, make single map task process appropriate data amount, and make map task process appropriate data amount.

Second, control the reduce number of hive tasks:

1. How Hive determines the reduce number himself:

The setting of the number of reduce greatly affects the task execution efficiency. If the number of reduce is not specified, Hive will guess and determine a number of reduce based on the following two settings:

hive.exec.reducers.bytes.per.reducer (amount of data processed per reduce task, default is 1000^3=1G)

hive.exec.reducers.max (maximum number of reducers per task, default 999)

The formula for calculating the reducer number is simple N=min(parameter 2, total input data/parameter 1)

That is, if the total size of reduce inputs (map outputs) does not exceed 1G, then there will be only one reduce task.

Select pt,count(1) from opt_tbaccountcopy_mes where pt = '2012-07-04' group by pt;

/group/p_sdo_data/p_sdo_data_etl/pt/opt_tbaccountcopy_mes/pt=2012-07-04 Total size is more than 9G, so this sentence has 10 reduce

2. Adjust the number of reduce methods 1:

Adjust the value of the hive.exec.reducers.bytes.per.reducer parameter;

set hive.exec.reducers.bytes.per.reducer=500000000; （500M）

select pt,count(1) from opt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; this time there are 20 reduce

3. Adjust the number of reduce method 2;

set mapred.reduce.tasks = 15;

select pt,count(1) from opt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; this time there are 15 reduce

4. reduce is not the more the better;

As with map, starting and initializing reduce consumes time and resources;

In addition, there will be as many output files as there are reduce files. If there are many small files generated, then if these small files are used as the input of the next task, there will also be too many small files.

5. When there is only one reduce;

Many times you will find that no matter how much data you have in the task, no matter whether you set the parameter to adjust the number of reduce, there is always only one reduce task in the task.

In fact, there is only one reduce task. In addition to the case where the data amount is less than the hive.exec.reducers.bytes.per.reducer parameter value, there are the following reasons:

a) There is no summary of group by, for example, write select pt,count(1) from opt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; select count(1) from opt_tbaccountcopy_mes where pt = '2012-07-04';

This is very common and I hope you can rewrite it as much as possible.

b) Order by

c) have Cartesian product

Usually in these cases, except for finding ways to work around and avoid, I don't have any good way for the time being, because these operations are global, so hadoop has to use a reduce to complete;

Similarly, when setting the number of reduce, you need to consider these two principles: make the large data volume use the appropriate number of reduce; make the single reduce task process the appropriate amount of data.

That's all for the article "How to control the number of maps and reduce in hive tasks." Thank you for reading! I believe that everyone has a certain understanding, hope to share the content to help everyone, if you still want to learn more knowledge, welcome to pay attention to the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.