Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

How to control the number of map and reduce in hive task in hive optimization

2025-02-24 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

This article mainly introduces the hive optimization of how to control the number of map and reduce in the hive task, has a certain reference value, interested friends can refer to, I hope you can learn a lot after reading this article, the following let Xiaobian with you to understand.

1. Control the number of map in the hive task:

1. Typically, jobs generate one or more map tasks from the input directory.

The main determining factors are the total number of files in input, the file size of input, and the file block size set by the cluster (currently 128m, which can be seen in hive through the set dfs.block.size; command. This parameter cannot be customized and modified).

two。 For example:

A) assuming that there is a file an in the input directory with a size of 780m, then hadoop splits the file an into seven blocks (six 128m blocks and one 12m block), resulting in seven map numbers

B) suppose there are three files in the input directory, the size of which is 10m, 20m, 130m, respectively, then the hadoop will be divided into four blocks (10m, 20m, 128m, 2m), resulting in four map numbers.

That is, if the file is larger than the block size (128m), it will be split, and if it is less than the block size, the file will be treated as a block.

3. Is it true that the more map, the better?

The answer is no. If a task has many small files (far less than the block size 128m), each small file will also be treated as a block and completed with a map task, while a map task will take much longer to start and initialize than the logical processing time, which will result in a great waste of resources. Moreover, the number of map that can be executed simultaneously is limited.

4. Does it make sure that each map handles close to 128m of file blocks, so you can rest easy?

The answer is not necessarily. For example, there is a 127m file, which normally uses a map to complete, but this file has only one or two small fields, but it has tens of millions of records. If the logic of map processing is more complex, it must be time-consuming to use a map task to do it.

To solve the above problems 3 and 4, we need to take two ways to solve them: reducing the number of map and increasing the number of map.

How to merge small files and reduce the map number?

Suppose a SQL task:

Select count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04'

Inputdir / group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 of the task

There are 194 files, many of which are much smaller than 128m, with a total size of 9G, and 194 map tasks will be used for normal execution.

Total computing resources consumed by Map: SLOTS_MILLIS_MAPS= 623020

I reduce the number of map by merging small files before map execution:

Set mapred.max.split.size=100000000

Set mapred.min.split.size.per.node=100000000

Set mapred.min.split.size.per.rack=100000000

Set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

Then execute the above statement, using 74 map tasks, the computing resources consumed by map: SLOTS_MILLIS_MAPS= 333500

For this simple SQL task, the execution time may be about the same, but half the computing resources are saved.

Roughly explain, 100000000 means 100m, and the parameter set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; indicates that small files are merged before execution.

The first three parameters determine the size of merged file blocks. Those larger than 128m are separated according to 128m, and those less than 128m and more than 100m are separated according to 100m, and those less than 100m are separated (including the rest of small files and separated large files).

It was merged and 74 blocks were finally generated.

How to increase the number of maps appropriately?

When the files of input are very large, the logic of the task is complex, and the execution of map is very slow, we can consider increasing the number of Map to reduce the amount of data processed by each map, so as to improve the efficiency of task execution.

Suppose there is a task:

Select data_desc

Count (1)

Count (distinct id)

Sum (case when …)

Sum (case when...)

Sum (…)

From a group by data_desc

If table a has only one file, the size is 120m, but contains tens of millions of records, it must be time-consuming to use 1 map to complete this task. In this case, we should consider splitting this file into multiple files reasonably.

This can be done with multiple map tasks.

Set mapred.reduce.tasks=10

Create table a_1 as

Select * from a

Distribute by rand (123)

In this way, the records of table a will be randomly distributed into the aqum1 table containing 10 files, and then replace table an in the above sql with aqum1, which will take 10 map tasks to complete.

Each map task will certainly be much more efficient when processing data larger than 12m (millions of records).

It seems that there is some contradiction between these two kinds, one is to merge small files, the other is to split large files into small files, which is the key point to pay attention to.

According to the actual situation, two principles should be followed to control the number of map: to make a large amount of data use the appropriate number of map, and to make a single map task handle the appropriate amount of data.

2. Control the number of reduce for hive tasks:

1. How to determine the number of reduce by Hive:

Setting the number of reduce greatly affects the efficiency of task execution. If the number of reduce is not specified, Hive will guess and determine the number of reduce, based on the following two settings:

Hive.exec.reducers.bytes.per.reducer (amount of data processed per reduce task, default is 1000 ^ 3 = 1G)

Hive.exec.reducers.max (maximum number of reduce per task, default is 999)

The formula for calculating the reducer number is simple N=min (parameter 2, total input data / parameter 1)

That is, if the total size of the input (output of map) of reduce does not exceed 1G, then there will be only one reduce task

For example, select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt

/ group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 has a total size of more than 9G, so this sentence has 10 reduce

two。 Method 1: adjust the number of reduce:

Adjust the value of the hive.exec.reducers.bytes.per.reducer parameter

Set hive.exec.reducers.bytes.per.reducer=500000000; (500m)

Select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; there are 20 reduce this time

3. Method 2 for adjusting the number of reduce

Set mapred.reduce.tasks = 15

Select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; there are 15 reduce this time

4. The number of reduce is not the more the better.

Like map, starting and initializing reduce takes time and resources

In addition, there will be as many output files as there are reduce. If many small files are generated, then if these small files are used as inputs for the next task, there will be the problem of too many small files.

5. When is there only one reduce?

In many cases, you will find that no matter how large the amount of data in a task, no matter whether you set the parameter to adjust the number of reduce, there is always only one reduce task in the task. In fact, in the case of only one reduce task, in addition to the fact that the amount of data is less than the hive.exec.reducers.bytes.per.reducer parameter value, there are the following reasons:

A) there is no summary of group by, for example, write select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; as select count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04'

This is very common. I hope you can rewrite it as much as possible.

B) Order by is used

C) have Cartesian product

Usually in these cases, apart from finding ways to adapt and avoid, I don't have any good methods for the time being, because these operations are global, so hadoop has to use a reduce to do it.

Similarly, these two principles need to be considered when setting the number of reduce: making a large amount of data use the appropriate number of reduce, and making a single reduce task handle the appropriate amount of data.

To be studied:

The number of map is usually determined by the DFS block size of the hadoop cluster, that is, the total number of input files. The parallel scale of the normal number of map is approximately 10 million per Node. For jobs with low CPU consumption, you can set the number of Map to about 300. However, since none of the tasks in hadoop takes a certain amount of time to initialize, it is reasonable that each map takes at least 1 minute to execute. The specific data sharding is like this. By default, InputFormat is sharded according to the DFS block size of the hadoop cluster, and each shard is processed by a map task. Of course, users can customize the settings in the job submission client by using the parameter mapred.min.split.size parameter. Another important parameter is mapred.map.tasks, which sets the number of map only as a hint, and only works when InputFormat determines the number of map tasks that are smaller than the mapred.map.tasks value. Similarly, the number of Map tasks can be set manually by using JobConf's conf.setNumMapTasks (int num) method. This method can be used to increase the number of map tasks, but cannot set the number of tasks to be less than the value obtained by dividing the input data of the Hadoop system. Of course, in order to improve the concurrency efficiency of the cluster, you can set a default number of map. When the number of users' map is small or smaller than the value of automatic segmentation, you can use a relatively large default value, so as to improve the efficiency of the overall hadoop cluster.

Thank you for reading this article carefully. I hope the article "how to control the number of map and reduce in hive tasks in hive optimization" shared by the editor will be helpful to everyone. At the same time, I also hope that you will support us and pay attention to the industry information channel. More related knowledge is waiting for you to learn!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report