How to control the number of map in hive 07/06 Update SLTechnology News&Howtos

How to control the number of map in hive

2025-07-06 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article is about how to control the number of map in hive. The editor thinks it is very practical, so share it with you as a reference and follow the editor to have a look.

1. Typically, jobs generate one or more map tasks from the input directory.

The main determining factors are the total number of files in input, the file size of input, and the file block size set by the cluster (currently 128m, which can be seen in hive through the set dfs.block.size; command. This parameter cannot be customized and modified).

two。 For example:

A) assuming that there is a file an in the input directory with a size of 780m, then hadoop splits the file an into seven blocks (six 128m blocks and one 12m block), resulting in seven map numbers

B) suppose there are three files in the input directory, the size of which is 10m, 20m, 130m, respectively, then the hadoop will be divided into four blocks (10m, 20m, 128m, 2m), resulting in four map numbers.

That is, if the file is larger than the block size (128m), it will be split, and if it is less than the block size, the file will be treated as a block.

3. Is it true that the more map, the better?

The answer is no. If a task has many small files (much smaller than the block size 128m), each small file will also be treated as a block and completed with a map task

However, the time of initialization and initialization of a map task is much longer than that of logical processing, which will result in a great waste of resources.

Moreover, the number of map that can be executed simultaneously is limited.

4. Does it make sure that each map handles close to 128m of file blocks, so you can rest easy?

The answer is not necessarily. For example, a 127m file is normally completed with a map, but this file has only one or two small fields and tens of millions of records.

If the logic of map processing is complex, it must be time-consuming to do it with a map task.

To solve the above problems 3 and 4, we need to take two ways to solve them: reducing the number of map and increasing the number of map.

How to merge small files and reduce the map number?

Suppose a SQL task:

Select count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04'

Inputdir/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 of the task

There are 194 files, many of which are much smaller than 128m, with a total size of 9G, and 194 map tasks will be used for normal execution.

Total computing resources consumed by Map: SLOTS_MILLIS_MAPS= 623020

I reduce the number of map by merging small files before map execution:

Set mapred.max.split.size=100000000

Set mapred.min.split.size.per.node=100000000

Set mapred.min.split.size.per.rack=100000000

Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Then execute the above statement, using 74 map tasks, the computing resources consumed by map: SLOTS_MILLIS_MAPS= 333500

For this simple SQL task, the execution time may be about the same, but half the computing resources are saved.

Roughly explain, 100000000 means 100m, and the parameter sethive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; indicates that small files are merged before execution.

The first three parameters determine the size of merged file blocks. Those larger than 128m are separated according to 128m, and those less than 128m and more than 100m are separated according to 100m, and those less than 100m are separated (including the rest of small files and separated large files).

It was merged and 74 blocks were finally generated.

How to increase the number of maps appropriately?

When the files of input are very large, the logic of the task is complex, and the execution of map is very slow, we can consider increasing the number of Map to reduce the amount of data processed by each map, so as to improve the efficiency of task execution.

Suppose there is a task:

Select data_desc

Count (1)

Count (distinct id)

Sum (case when …)

Sum (case when...)

Sum (…)

From a group by data_desc

If table a has only one file, the size is 120m, but contains tens of millions of records, it must be time-consuming to use 1 map to complete this task. In this case, we should consider splitting this file into multiple files reasonably.

This can be done with multiple map tasks.

Set mapred.reduce.tasks=10

Create table a_1 as

Select * from a

Distribute by rand (123)

In this way, the records of table a will be randomly distributed into the aqum1 table containing 10 files, and then replace table an in the above sql with aqum1, which will take 10 map tasks to complete.

Each map task will certainly be much more efficient when processing data larger than 12m (millions of records).

It seems that there is some contradiction between these two kinds, one is to merge small files, the other is to split large files into small files, which is the key point to pay attention to.

According to the actual situation, two principles should be followed to control the number of map: to make a large amount of data use the appropriate number of map, and to make a single map task handle the appropriate amount of data.

Thank you for reading! This is the end of this article on "how to control the number of map in hive". I hope the above content can be of some help to you, so that you can learn more knowledge. if you think the article is good, you can share it for more people to see!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.