How to control the number of reduce for hive tasks 04/21 Update SLTechnology News&Howtos

How to control the number of reduce for hive tasks

2025-04-21 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Shulou(Shulou.com)05/31 Report--

This article mainly introduces how to control the reduce number of hive tasks, the article is very detailed, has a certain reference value, interested friends must read it!

1. How to determine the number of reduce by Hive:

Setting the number of reduce greatly affects the efficiency of task execution. If the number of reduce is not specified, Hive will guess and determine the number of reduce, based on the following two settings:

Hive.exec.reducers.bytes.per.reducer (amount of data processed per reduce task, default is 1000 ^ 3 = 1G)

Hive.exec.reducers.max (maximum number of reduce per task, default is 999)

The formula for calculating the reducer number is simple N=min (parameter 2, total input data / parameter 1)

That is, if the total size of the input (output of map) of reduce does not exceed 1G, then there will be only one reduce task

For example, select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt

/ group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 has a total size of more than 9G, so this sentence has 10 reduce

two。 Method 1: adjust the number of reduce:

Adjust the value of the hive.exec.reducers.bytes.per.reducer parameter

Set hive.exec.reducers.bytes.per.reducer=500000000; (500m)

Select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; there are 20 reduce this time

3. Method 2 for adjusting the number of reduce

Set mapred.reduce.tasks = 15

Select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; there are 15 reduce this time

4. The number of reduce is not the more the better.

Like map, starting and initializing reduce takes time and resources

In addition, there will be as many output files as there are reduce. If many small files are generated, then if these small files are used as inputs for the next task, there will be the problem of too many small files.

5. When is there only one reduce?

In many cases, you will find that no matter how much data is in the task, no matter whether you set the parameter to adjust the number of reduce, there is always only one reduce task in the task.

In fact, when there is only one reduce task, in addition to the case where the amount of data is less than the value of the hive.exec.reducers.bytes.per.reducer parameter, there are also the following reasons:

A) there is no summary of group by, for example, write select pt,count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; as select count (1) from popt_tbaccountcopy_mes where pt = '2012-07-04'

This is very common. I hope you can rewrite it as much as possible.

B) Order by is used

C) have Cartesian product

Usually in these cases, apart from finding ways to adapt and avoid, I don't have any good methods for the time being, because these operations are global, so hadoop has to use a reduce to do it.

Similarly, these two principles need to be considered when setting the number of reduce: making a large amount of data use the appropriate number of reduce, and making a single reduce task handle the appropriate amount of data.

Hive is a tool for parsing strings that conform to SQL syntax to generate MapReduce that can be executed on Hadoop.

Using Hive to design sql according to some characteristics of distributed computing as far as possible, which is different from traditional relational database.

Therefore, we need to get rid of some of the inherent thinking of the development under the original relational database.

Basic principles:

1: filter the data as early as possible, reduce the amount of data in each stage, partition the partition table, and select only the fields that need to be used

Select... From A

Join B

On A.key = B.key

Where A.userid > 10

And B.userid10

) a

Join (select.... From B

Where dt='201200417'

And userid < 10

) b

On a.key = b.key

2: atomize the operation as much as possible, and try to avoid a SQL containing complex logic

You can use intermediate tables to complete complex logic

Drop table if exists tmp_table_1

Create table if not exists tmp_table_1 as

Select.

Drop table if exists tmp_table_2

Create table if not exists tmp_table_2 as

Select.

Drop table if exists result_table

Create table if not exists result_table as

Select.

Drop table if exists tmp_table_1

Drop table if exists tmp_table_2

3: the number of JOB generated by a single SQL should be kept below 5 as far as possible.

4: use mapjoin cautiously. Generally, the number of rows is less than 2000 rows, and only tables with a size less than 1m (which can be appropriately enlarged after expansion) can be used. Small tables should be placed on the left side of join (currently, many small tables in TCL are placed on the right side of join).

Otherwise, it will cause a lot of disk and memory consumption.

5: to write SQL, you should first understand the characteristics of the data itself. If you have join or group operations, you should pay attention to whether there will be data skew.

If there is a data skew, you should do the following:

Set hive.exec.reducers.max=200

Set mapred.reduce.tasks= 200-increase the number of Reduce

Set hive.groupby.mapaggr.checkinterval=100000;-- if the number of records corresponding to the key of group exceeds this value, it will be split. The value is set according to the specific amount of data.

Set hive.groupby.skewindata=true;-set to true if there is a tilt in the groupby process

Set hive.skewjoin.key=100000;-if the number of records corresponding to the key of join exceeds this value, it will be split. The value is set according to the specific amount of data.

Set hive.optimize.skewjoin=true;-- should be set to true if the tilt occurs in the join process

6: if the number of union all parts is greater than 2, or the amount of data in each union part is large, it should be split into multiple insert into statements. In the actual testing process, the execution time can be increased by 50%.

Insert overwite table tablename partition (dt=....)

Select. From (

Select... From A

Union all

Select... From B

Union all

Select... From C

) R

Where...

It can be rewritten as:

Insert into table tablename partition (dt=....)

Select.... From A

WHERE...

Insert into table tablename partition (dt=....)

Select.... From B

WHERE...

Insert into table tablename partition (dt=....)

Select.... From C

WHERE...

The above is all the contents of the article "how to Control the reduce number of hive tasks". Thank you for reading! Hope to share the content to help you, more related knowledge, welcome to follow the industry information channel!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.